# Overview
Goal: Build a web-based chatbot that:

* Lets users upload 75+ documents (PDFs, images).

* Extracts and stores content in a vector database.

* Allows natural language queries with citation-level answers.

* Identifies and synthesizes cross-document themes.



### We need to install essential libraries for:

OCR (Tesseract)

PDF/text extraction

Vector storage & semantic search

LLM interaction (OpenAI, etc.)

In [1]:
#importing warnings.....
import warnings
warnings.filterwarnings('ignore')

In [2]:

!pip install pytesseract pdf2image PyMuPDF openai langchain chromadb tiktoken unstructured
!apt-get install -y poppler-utils tesseract-ocr




'apt-get' is not recognized as an internal or external command,
operable program or batch file.


#### Installed Tools:
pytesseract – OCR for scanned documents.

pdf2image – Convert PDF pages to images.

PyMuPDF (fitz) – Text-based PDF extraction.

openai – For interacting with GPT models.

langchain – LLM orchestration.

chromadb – Vector DB for semantic search.

In [3]:

!pip install pymupdf pytesseract pdf2image pillow
!apt-get install poppler-utils tesseract-ocr 




'apt-get' is not recognized as an internal or external command,
operable program or batch file.


In [9]:
# Install dependencies (uncomment if running for the first time)
# !pip install pymupdf

import fitz  # PyMuPDF
from pathlib import Path

def extract_text_from_pdf(pdf_path):
    """
    Extracts text from each page of a PDF using PyMuPDF (fitz).
    
    Parameters:
        pdf_path (str or Path): File name or path of the PDF in the same directory.

    Returns:
        dict: Dictionary with keys as 'Page_1', 'Page_2', etc., and values as text content.
    """
    doc = fitz.open(pdf_path)
    text_data = {}

    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text = page.get_text()
        text_data[f"Page_{page_num + 1}"] = text.strip()

    doc.close()
    return text_data


In [10]:
# Define file name (must match exactly, case-sensitive)
pdf_path = "Resume_Kaameshwar_pdf.pdf"

# Extract text from the PDF
resume_text = extract_text_from_pdf(pdf_path)

# Preview first page content
print("Preview of Page 1:\n")
print(resume_text['Page_1'][:500])


Preview of Page 1:

Kaameshwar Rai
Bengaluru, KA - India | kaameshwarrai@gmail.com | +91-9910713109 | LinkedIn | Github | Kaggle
PROFILE SUMMARY
●
Analytical individual with excellent communication, programming expertise, and strong aptitude. Proven
leadership abilities demonstrated by effectively leading project groups during undergraduate and diploma
courses, resulting in the successful completion of projects.
KEY SKILLS
●
Tech Skills: Proficient in Python, SQL, Tableau, and an array of Python libraries including


## We can see my text extraction logic using PyMuPDF is working correctly since it is able to extract text from my Resume pdf.

## What We’ve Achieved So Far:
* Successfully extracted clean text from a text-based PDF

* Verified the Jupyter-compatible file path

* Printed the first 500 characters to validate content



# Embed Text & Create Vector Store

In [11]:
import re

def clean_and_chunk_text(text_dict, chunk_size=400, overlap=100):
    """
    Cleans and splits extracted PDF text into overlapping chunks.
    
    Args:
        text_dict (dict): Output from extract_text_from_pdf().
        chunk_size (int): Number of characters in each chunk.
        overlap (int): Number of overlapping characters between chunks.
    
    Returns:
        List[str]: Cleaned and chunked text segments.
    """
    all_chunks = []

    for page_num, content in text_dict.items():
        # Step 1: Clean text
        cleaned = re.sub(r'\n+', '\n', content)  # collapse multiple newlines
        cleaned = re.sub(r'\s+', ' ', cleaned).strip()  # remove excess whitespace

        # Step 2: Split into overlapping chunks
        for i in range(0, len(cleaned), chunk_size - overlap):
            chunk = cleaned[i:i + chunk_size]
            if len(chunk.strip()) > 100:  # avoid garbage chunks
                all_chunks.append(chunk.strip())

    return all_chunks


In [29]:
resume_chunks = clean_and_chunk_text(resume_text)
print(f" Total chunks: {len(resume_chunks)}")
print("\n Sample Chunk:\n", resume_chunks[0])

 Total chunks: 9

 Sample Chunk:
 Kaameshwar Rai Bengaluru, KA - India | kaameshwarrai@gmail.com | +91-9910713109 | LinkedIn | Github | Kaggle PROFILE SUMMARY ● Analytical individual with excellent communication, programming expertise, and strong aptitude. Proven leadership abilities demonstrated by effectively leading project groups during undergraduate and diploma courses, resulting in the successful completion of projects. KEY


# Storing these chunks in a vector database (ChromaDB) using OpenAI or free local embeddings.

In [15]:
!pip install langchain-community langchain chromadb openai tiktoken




In [17]:
# Install free embeddings + Chroma
!pip install -U sentence-transformers chromadb langchain-community


Collecting sentence-transformers
  Downloading sentence_transformers-4.1.0-py3-none-any.whl (345 kB)
     -------------------------------------- 345.7/345.7 kB 1.3 MB/s eta 0:00:00
Collecting transformers<5.0.0,>=4.41.0
  Downloading transformers-4.52.1-py3-none-any.whl (10.5 MB)
     ---------------------------------------- 10.5/10.5 MB 2.1 MB/s eta 0:00:00
Collecting safetensors>=0.4.3
  Downloading safetensors-0.5.3-cp38-abi3-win_amd64.whl (308 kB)
     -------------------------------------- 308.9/308.9 kB 2.1 MB/s eta 0:00:00
Installing collected packages: safetensors, transformers, sentence-transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.24.0
    Uninstalling transformers-4.24.0:
      Successfully uninstalled transformers-4.24.0
Successfully installed safetensors-0.5.3 sentence-transformers-4.1.0 transformers-4.52.1


In [19]:

!pip uninstall -y torch torchvision torchaudio sentence-transformers transformers

Found existing installation: torch 1.12.1
Uninstalling torch-1.12.1:
  Successfully uninstalled torch-1.12.1
Found existing installation: sentence-transformers 4.1.0
Uninstalling sentence-transformers-4.1.0:
  Successfully uninstalled sentence-transformers-4.1.0
Found existing installation: transformers 4.52.1
Uninstalling transformers-4.52.1:
  Successfully uninstalled transformers-4.52.1




In [20]:
# Installing required packages fresh
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
!pip install sentence-transformers transformers langchain-community chromadb langchain-huggingface


Looking in indexes: https://download.pytorch.org/whl/cpu
Collecting torch
  Obtaining dependency information for torch from https://download.pytorch.org/whl/cpu/torch-2.7.0%2Bcpu-cp310-cp310-win_amd64.whl.metadata
  Downloading https://download.pytorch.org/whl/cpu/torch-2.7.0%2Bcpu-cp310-cp310-win_amd64.whl.metadata (29 kB)
Collecting torchvision
  Obtaining dependency information for torchvision from https://download.pytorch.org/whl/cpu/torchvision-0.22.0%2Bcpu-cp310-cp310-win_amd64.whl.metadata
  Downloading https://download.pytorch.org/whl/cpu/torchvision-0.22.0%2Bcpu-cp310-cp310-win_amd64.whl.metadata (6.3 kB)
Collecting torchaudio
  Obtaining dependency information for torchaudio from https://download.pytorch.org/whl/cpu/torchaudio-2.7.0%2Bcpu-cp310-cp310-win_amd64.whl.metadata
  Downloading https://download.pytorch.org/whl/cpu/torchaudio-2.7.0%2Bcpu-cp310-cp310-win_amd64.whl.metadata (6.7 kB)
Collecting sympy>=1.13.3
  Obtaining dependency information for sympy>=1.13.3 from https

In [22]:
!pip uninstall -y torch torchvision torchaudio sentence-transformers transformers


Found existing installation: torch 2.7.0+cpu
Uninstalling torch-2.7.0+cpu:
  Successfully uninstalled torch-2.7.0+cpu
Found existing installation: torchvision 0.22.0+cpu
Uninstalling torchvision-0.22.0+cpu:
  Successfully uninstalled torchvision-0.22.0+cpu
Found existing installation: torchaudio 2.7.0+cpu
Uninstalling torchaudio-2.7.0+cpu:
  Successfully uninstalled torchaudio-2.7.0+cpu
Found existing installation: sentence-transformers 4.1.0
Uninstalling sentence-transformers-4.1.0:
  Successfully uninstalled sentence-transformers-4.1.0
Found existing installation: transformers 4.52.1
Uninstalling transformers-4.52.1:
  Successfully uninstalled transformers-4.52.1


In [23]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu


Looking in indexes: https://download.pytorch.org/whl/cpu
Collecting torch
  Obtaining dependency information for torch from https://download.pytorch.org/whl/cpu/torch-2.7.0%2Bcpu-cp310-cp310-win_amd64.whl.metadata
  Using cached https://download.pytorch.org/whl/cpu/torch-2.7.0%2Bcpu-cp310-cp310-win_amd64.whl.metadata (29 kB)
Collecting torchvision
  Obtaining dependency information for torchvision from https://download.pytorch.org/whl/cpu/torchvision-0.22.0%2Bcpu-cp310-cp310-win_amd64.whl.metadata
  Using cached https://download.pytorch.org/whl/cpu/torchvision-0.22.0%2Bcpu-cp310-cp310-win_amd64.whl.metadata (6.3 kB)
Collecting torchaudio
  Obtaining dependency information for torchaudio from https://download.pytorch.org/whl/cpu/torchaudio-2.7.0%2Bcpu-cp310-cp310-win_amd64.whl.metadata
  Using cached https://download.pytorch.org/whl/cpu/torchaudio-2.7.0%2Bcpu-cp310-cp310-win_amd64.whl.metadata (6.7 kB)
Using cached https://download.pytorch.org/whl/cpu/torch-2.7.0%2Bcpu-cp310-cp310-win_a

In [24]:
!pip install sentence-transformers transformers langchain-huggingface langchain-community chromadb


Collecting sentence-transformers
  Using cached sentence_transformers-4.1.0-py3-none-any.whl (345 kB)
Collecting transformers
  Using cached transformers-4.52.1-py3-none-any.whl (10.5 MB)
Installing collected packages: transformers, sentence-transformers
Successfully installed sentence-transformers-4.1.0 transformers-4.52.1


In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.docstore.document import Document

#  Use offline model
embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Convert chunks to documents
documents = [
    Document(page_content=chunk, metadata={"source": f"Resume_Chunk_{i+1}"})
    for i, chunk in enumerate(resume_chunks)
]

#  Store vectors in ChromaDB
vector_db = Chroma.from_documents(
    documents,
    embedding=embedding,
    persist_directory="resume_chroma_local"
)
vector_db.persist()

print(" Successfully stored using local embeddings.")

In [None]:
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

# Reconnect to saved vector DB
embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_db = Chroma(
    persist_directory="resume_chroma_local",
    embedding_function=embedding
)
retriever = vector_db.as_retriever(search_kwargs={"k": 3}) 

In [None]:

from transformers import pipeline

# Load a local Q&A model (no API needed)
qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")

In [None]:
def ask_resume_question(query: str):
    # Step 1: Get top matching documents
    docs = retriever.get_relevant_documents(query)
    
    # Step 2: Merge content into context
    context = "\n\n".join(doc.page_content for doc in docs)

    # Step 3: Run QA pipeline
    answer = qa_pipeline({
        'context': context,
        'question': query
    })

    # Step 4: Return result with citations
    cited_sources = [doc.metadata['source'] for doc in docs]
    return {
        "question": query,
        "answer": answer['answer'],
        "confidence": answer['score'],
        "citations": cited_sources
    }

# Submitted by Kaameshwar Rai