# Lab 2 (PDF chatbot)

*A Demo for building a question-answering chatbot utilizing FLAN-T5 and RAG (Retrieval-Augmented Generation) to answer questions from PDF documents accurately.*

**Methodology**:

1. Extracting text content from PDF files
2. Chunking the text into smaller, manageable pieces with overlap
3. Converting chunks into embeddings using sentence-transformers
4. Storing embeddings in a FAISS vector database for fast retrieval
5. Retrieving the most relevant chunks based on the user's question
6. Providing FLAN-T5 with the retrieved context to generate accurate answers

---

## 1. Kaggle Setup

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/tips-hindawi-university/Tips Hindawi University Info.pdf


---

## 2. Installing Required Packages

In [2]:
!pip install transformers==4.52.4

Collecting transformers==4.52.4
  Downloading transformers-4.52.4-py3-none-any.whl.metadata (38 kB)
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers==4.52.4)
  Downloading huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Downloading transformers-4.52.4-py3-none-any.whl (10.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m96.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m:01[0m
[?25hDownloading huggingface_hub-0.36.0-py3-none-any.whl (566 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m566.1/566.1 kB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: huggingface-hub, transformers
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 1.0.0rc2
    Uninstalling huggingface-hub-1.0.0rc2:
      Successfully uninstalled huggingface-hub-1.0.0rc2
  Attempting uninstall: transformers
    Found existing installation: transformers 4.53.3
    Uninstalling transfo

In [3]:
!pip install sentence-transformers PyPDF2 faiss-cpu

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>

---

## 3. Importing Packages

In [4]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

In [5]:
import faiss
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer

2025-10-28 11:33:43.027332: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1761651223.217395      37 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1761651223.273595      37 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


--- 

## 4. Helper Functions

In [50]:
# Modified generate function for FLAN-T5
def generate_text_t5(prompt, max_length=512):
    inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        temperature=0.3,
        do_sample=True,
        top_p=0.9
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [51]:
def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    full_text = ""
    for page in reader.pages:
        full_text += page.extract_text() + "\n"
    return full_text

In [52]:
def chunk_text(text, chunk_size=500, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

In [53]:
def embed_chunks(chunks, model_name='sentence-transformers/all-MiniLM-L6-v2'):
    model = SentenceTransformer(model_name)
    embeddings = model.encode(chunks, convert_to_numpy=True)
    return model, embeddings

In [54]:
def create_faiss_index(embeddings):
    dim = embeddings.shape[1]
    index = faiss.IndexFlatL2(dim)
    index.add(embeddings)
    return index

In [55]:
def search_index(query, model, index, chunks, k=5):
    query_embedding = model.encode([query], convert_to_numpy=True)
    distances, indices = index.search(query_embedding, k)
    return [chunks[i] for i in indices[0]]

---

## 5. Trying it

In [56]:
# Load FLAN-T5 model
model_name = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")



In [57]:
# PDF calling and chunking it
pdf_path = "/kaggle/input/tips-hindawi-university/Tips Hindawi University Info.pdf"  

text = extract_text_from_pdf(pdf_path)
chunks = chunk_text(text,chunk_size=150, overlap=30)

In [58]:
# Embedding text chunks
model_embeddings, embeddings = embed_chunks(chunks)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [59]:
# Creating the vector database
index = create_faiss_index(embeddings)

In [80]:
question = "How many residence halls are on the main campus?"
top_chunks = search_index(question, model_embeddings, index, chunks, k=3)

for i, chunk in enumerate(top_chunks, 1):
    print(f"\n--- Chunk {i} ---\n{chunk}")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--- Chunk 1 ---
It houses: - 12 academic buildings - Central library with over 1 million volumes - Four residence halls - A medical center - Sports complex and stadium - Innovation and entrepreneurship hub 2.2 Satellite Campuses North Campus : Focused on agricultural sciences and environmental research West Campus : Home to the School of Design and Architecture 2.3 Student Housing Al-Nour Dormitory (First-year undergraduates) Rafidain Apartments (Graduate students) Amal Housing Complex (Family and international students) 3. Academic Structure 3.1 Faculties Faculty of Engineering Faculty of Medicine and Health Sciences Faculty of Business and Economics Faculty of Arts and Humanities Faculty of Law and International Studies Faculty of Computer and Information Sciences Faculty of Architecture and Design• • • • • • • • • • • • 1 3.2 Sample Undergraduate Programs BSc in Computer Science BA in International Relations BBA in Marketing BEng in Civil Engineering BSc in Nursing 3.3

--- Chunk 2

In [81]:
prompt = f"""Answer the next question: {question}
Based on this list of paragraphs {top_chunks}, 
and if the answer does not lie in the paragraphs you should output something demostrate that
"""

In [82]:
answer = generate_text_t5(prompt, max_length=500)

In [83]:
print(answer)

Four residence halls
