# PDF Q&A Bot with FAISS + Sentence Transformers + FLAN-T5


> Upload a PDF and ask anything about it — get smart, contextual answers from the doc!

<details>
<summary><strong>How it works </strong></summary>

1. **Extract Text**: Uses PyMuPDF (`fitz`) to pull all text from uploaded PDF.
2. **Split into Chunks**: Breaks content into 250-character chunks using `nltk` sentence tokenizer.
3. **Embed Chunks**: Uses `all-MiniLM-L6-v2` from Sentence Transformers to embed each chunk into vector space.
4. **Index with FAISS**: Adds all embeddings to a FAISS index for efficient semantic search.
5. **Ask Questions**:
   - Input a question.
   - Retrieves top-k most relevant chunks.
   - Combines context and feeds it to **FLAN-T5** (`google/flan-t5-base`) to generate an answer.

</details>

<details>
<summary><strong>Libraries Used </strong></summary>

- `sentence-transformers` for dense embeddings
- `faiss-cpu` for semantic search
- `PyMuPDF` for PDF parsing
- `transformers` for loading FLAN-T5
- `nltk` for sentence-level chunking

</details>

---

**Try asking things like:**
- *“What is the main conclusion of this document?”*  
- *“Summarize the methodology.”*  
- *“What are the key findings in this pdf? ”*

# STEP 1: Install required libraries and upload your pdf.

In [None]:

!pip install -q faiss-cpu sentence-transformers transformers PyMuPDF

In [None]:
from google.colab import files
import fitz  # PyMuPDF

uploaded = files.upload()
pdf_path = next(iter(uploaded))

# Extract text from pdf and split into chunks

In [None]:
def extract_text_from_pdf(path):
    doc = fitz.open(path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

raw_text = extract_text_from_pdf(pdf_path)
print("PDF loaded!")

PDF loaded!


In [None]:

from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

def split_into_chunks(text, max_len=250):
    sentences = sent_tokenize(text)
    chunks, chunk = [], ""
    for sentence in sentences:
        if len(chunk) + len(sentence) <= max_len:
            chunk += sentence + " "
        else:
            chunks.append(chunk.strip())
            chunk = sentence + " "
    if chunk: chunks.append(chunk.strip())
    return chunks

chunks = split_into_chunks(raw_text)
print(f"Split into {len(chunks)} chunks")

Split into 567 chunks


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


#Embed chunks with transformer and ask your questions

In [None]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

embedder = SentenceTransformer('all-MiniLM-L6-v2')
chunk_embeddings = embedder.encode(chunks, show_progress_bar=True)


index = faiss.IndexFlatL2(chunk_embeddings[0].shape[0])
index.add(np.array(chunk_embeddings))


def retrieve_context(question, k=5):
    q_emb = embedder.encode([question])
    _, indices = index.search(np.array(q_emb), k)
    return [chunks[i] for i in indices[0]]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/18 [00:00<?, ?it/s]

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")

def answer_question(question, k=5):
    context = retrieve_context(question, k)
    prompt = f"Context: {' '.join(context)} \n\nQuestion: {question} \nAnswer:"
    inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(**inputs, max_new_tokens=100)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
while True:
    q = input("Ask a question (or type 'exit'): ")
    if q.lower() == 'exit':
        break
    print("Answer:", answer_question(q))