# Document Chatbot 
This is a minimal document-based chatbot using embeddings + similarity search, futher to be enhanced with better chunking and sentence-level extraction.

---
## 0. Setup

In [5]:
# Importing necessary libraries
import os
import re # For regular expressions(data processing cleaning)
import numpy as np
from pathlib import Path # For handling file paths
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer# For sentence embeddings
import pdfplumber# For PDF text extraction
import docx# 

DATA_DIR = Path('../data')# difining folder path for the documents to be accesed
DATA_DIR.resolve()


  from .autonotebook import tqdm as notebook_tqdm


WindowsPath('D:/Users/Swapnil_IDS_GENAI/GENAI/Assignments/projects/Domain_specific_chatbot/data')

We see that the documents in the folder data will be used for answering questions and the path is also defined here.

## 1. Load a document
We'll auto-pick a document from `data/`. You can change the file name here.

In [6]:
# Listing all relevant document files in the data directory
files = [p for p in DATA_DIR.iterdir() if p.suffix.lower() in {'.pdf', '.docx', '.txt'}]
files


[WindowsPath('../data/Introduction-to-Machine-Learning.pdf'),
 WindowsPath('../data/machine_learning.docx'),
 WindowsPath('../data/machine_learning_tutotrial.txt')]

In [7]:
# Select file (update if you want a different one)
doc_path = files[0]
doc_path


WindowsPath('../data/Introduction-to-Machine-Learning.pdf')

we have selected the first doc for test purpose.

In [8]:
# Functions to read different file types

# Functions to read PDF files
def read_pdf(path):
    text = ''
    with pdfplumber.open(path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + '\n'
    return text

# Function to read docx files
def read_docx(path):
    doc = docx.Document(path)
    return '\n'.join(p.text for p in doc.paragraphs)

# Function to read txt files
def read_txt(path):
    return Path(path).read_text(encoding='utf-8', errors='ignore')

# General function to load file based on its extension
def load_file(path):
    path = Path(path)
    if path.suffix.lower() == '.pdf':
        return read_pdf(path)
    if path.suffix.lower() == '.docx':
        return read_docx(path)
    if path.suffix.lower() == '.txt':
        return read_txt(path)
    return ''

text = load_file(doc_path) 
len(text)  # toltal lenghth of the document is printed


132343

## 2. Chunk the document
Chunking is the step where we split a long document into smaller, meaningful pieces before we embed and search it.
we do chunking (breaking down document content to chunks of words)by
1.Splitting by paragraph first (keeping context intact)
2.Splitting long paragraphs into sentences
3.Adding overlap (last 2 sentences) to avoid cutting ideas in half
This step is crucial because good chunking = better retrieval.
We'll start with a basic sentence-based chunker and then improve it.

In [9]:
# Basic text chunking function based on sentence boundaries and word count

def chunk_text_basic(text, max_words=300, overlap=50):
    sentences = re.split(r'(?<=[.!?]) +', text)# Split text into sentences
    chunks = []# List to hold the text chunks
    chunk = []# Current chunk being built
    word_count = 0# Current word count in the chunk

    # Iterate through sentences and build chunks
    for s in sentences:
        words = s.split()# Split sentence into words
        chunk.append(s)# Add sentence to current chunk
        word_count += len(words)# Update word count

        # If the chunk exceeds max_words, finalize it and start a new chunk
        if word_count >= max_words:
            chunks.append(' '.join(chunk))# Add the current chunk to the list of chunks
            chunk = chunk[-max(1, overlap//10):]# Start new chunk with overlap sentences
            word_count = sum(len(w.split()) for w in chunk)# Recalculate word count for the new chunk
            #The result is the total word count of the chunk after overlap trimming, 
            #which keeps the chunk size accurate for the next loop.

    if chunk:
        chunks.append(' '.join(chunk))# Add any remaining sentences as the last chunk
    return chunks
# Chunk the loaded text
chunks = chunk_text_basic(text)
len(chunks), chunks[0][:400]# Show number of chunks and preview of the first chunk


(205,
 '1\n1\n2\n1\nPart 1: Introduction to Machine Learning 6\nChapter 1: What is Machine Learning? 6\nDefinition and History of Machine Learning 6\nKey Concepts in Machine Learning 6\nReal-World Applications of Machine Learning 7\nChapter 2: Types of Machine Learning 8\nSupervised Learning 9\nUnsupervised Learning 9\nReinforcement Learning 10\nChapter 3: Applications of Machine Learning 10\nImage and Speech Recogniti')

we see that the doc has table of contents(TOC) like structure so first chunks are not so useful for answering questions because they list headings, not explanations.We can skip chunks that look like a TOC, so the chatbot only embeds real content.
 

### Improved chunking (paragraph-first + sentence split)
This keeps meaning intact and avoids cutting ideas in half.

In [10]:
# Improved text chunking function with paragraph handling and sentence overlap
def chunk_text_better(text, max_words=250, overlap_sentences=2):
    # Split text into paragraphs and clean up
    paragraphs = [p.strip() for p in text.split('\n') if p.strip()] 
    chunks = []
    cur = []
    cur_words = 0

# Helper function to flush current chunk to the list of chunks
    def flush_chunk():
        if cur:
            chunks.append(' '.join(cur))
# Reset current chunk and word count
    for p in paragraphs:
        words = p.split()# Split paragraph into words
        if len(words) > max_words:
            # split long paragraph into sentences
            sentences = re.split(r'(?<=[.!?]) +', p)
            for s in sentences:
                s_words = s.split()
                # If adding this sentence exceeds max_words, flush current chunk
                if cur_words + len(s_words) > max_words:
                    flush_chunk()
                    cur[:] = cur[-overlap_sentences:]
                    cur_words = sum(len(x.split()) for x in cur)
                cur.append(s)
                cur_words += len(s_words)

                # After processing all sentences, flush any remaining chunk
        else:
            if cur_words + len(words) > max_words:
                flush_chunk()
                cur[:] = cur[-overlap_sentences:]
                cur_words = sum(len(x.split()) for x in cur)
            cur.append(p)
            cur_words += len(words)

    flush_chunk()
    return chunks
# Chunk the loaded text using the improved method
better_chunks = chunk_text_better(text)
len(better_chunks), better_chunks[0][:400]


(89,
 '1 1 2 1 Part 1: Introduction to Machine Learning 6 Chapter 1: What is Machine Learning? 6 Definition and History of Machine Learning 6 Key Concepts in Machine Learning 6 Real-World Applications of Machine Learning 7 Chapter 2: Types of Machine Learning 8 Supervised Learning 9 Unsupervised Learning 9 Reinforcement Learning 10 Chapter 3: Applications of Machine Learning 10 Image and Speech Recogniti')

## 3. Embed chunks
Lets use a small, fast model to generate embeddings.
'sentence-transformers/all-MiniLM-L6-v2'

In [None]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# Encode chunks into embeddings
chunk_embeddings = model.encode(better_chunks, show_progress_bar=False, convert_to_numpy=True)
chunk_embeddings.shape# (num_chunks, embedding_dim)
# This  returns the embeddings as a NumPy array (easy for math like cosine similarity).
#chunk_embeddings becomes a 2D NumPy array, shape like:
#So each row is the vector for one chunk, which later lets us compare questions to chunks.

(89, 384)

## 4. Retrieve top-k chunks
This is very important to understand that a document may have hundreds of chunks and we don’t want to scan or answer from all of them.
So we embed the question and pick the k most similar chunks.
keeping k>1 allows us to fetch chunks that have high correlation to the question and  its supporting details. 
This  makes the top‑k retrieval the core of making the chatbot precise and fast. 


In [12]:
# Function to retrieve top-k relevant chunks for a given question
def retrieve_chunks(question, chunks, embeddings, top_k=3):
    # Encode the question into an embedding
    q_emb = model.encode([question], convert_to_numpy=True)
    # Calculate cosine similarity scores between question and chunk embeddings
    scores = cosine_similarity(q_emb, embeddings)[0]
    # Get indices of top-k highest scoring chunks
    top_idx = np.argsort(scores)[-top_k:][::-1]
    # Return list of tuples (index, score, chunk)
    return [(i, scores[i], chunks[i]) for i in top_idx]

question = 'What is machine learning?'# example question

results = retrieve_chunks(question, better_chunks, chunk_embeddings)
# Retrieve top relevant chunks
[(i, round(score, 3)) for i, score, _ in results]# Show indices and scores of top relevant chunks


[(2, 0.757), (78, 0.685), (17, 0.659)]

we see that we have fetched the relevant chunks for the query/question

## 5. Extract most relevant sentences
Instead of returning entire chunks, it:
splits top chunks into sentences,embeds each sentence and picks top 2–3 most similar sentences.
This helps return “exact relevant info”.

In [13]:
# Function to extract best sentences from top chunks based on question relevance
def extract_best_sentences(question, chunks, top_k=3, max_sentences=3):
    # gather candidate sentences
    candidates = []
    for _, _, chunk in chunks:
        """
        Each item in chunks is a tuple like (index, score, chunk_text)
        We only care about the chunk text,
        so _ ignores the index and score.
        """
        sentences = [s.strip() for s in re.split(r'(?<=[.!?]) +', chunk) if len(s.strip()) > 25]
        # Filter out very short sentences
        candidates.extend(sentences)

    if not candidates:
        return 'No relevant sentences found.'

    # embed sentences
    sent_emb = model.encode(candidates, show_progress_bar=False, convert_to_numpy=True)
    # we have converted all candidates to vectors(embeddings)
    q_emb = model.encode([question], convert_to_numpy=True)# embedding for the question
    #Calculating cosine similarity between question and candidate sentences
    scores = cosine_similarity(q_emb, sent_emb)[0]
    top_idx = np.argsort(scores)[-max_sentences:][::-1]# Get indices of top scoring sentences (indices) ordered from most relevant to least
    return ' '.join([candidates[i] for i in top_idx])

# Extract best sentences for the example question
answer = extract_best_sentences(question, results)
answer


'Machine learning is a branch of artificial intelligence that focuses on creating algorithms that can learn from data and make predictions or decisions based on that data. Machine learning algorithms are used to discover patterns and insights in data, automate decision-making processes, and create intelligent systems that can learn and adapt to new information. Definition and History of Machine Learning Machine learning has its roots in the field of statistics and has evolved over time with contributions from various fields such as computer science, mathematics, and engineering.'

# Inference:

We see that the most relevant information from the pdf is chosen to answer the question.

## 6. Wrapping into a tiny chatbot class
This used the functions defined earlier to setup a mini chatbot with the improved logic.

In [14]:
# Mini document chatbot class encapsulating ingestion and Q&A functionality
class MiniDocChatbot:

    # Initialize with a sentence transformer model
    def __init__(self, model):
        self.model = model # Store the model
        self.chunks = [] # List to hold text chunks
        self.embeddings = None


    # Ingest text by chunking and embedding
    def ingest(self, text):
        self.chunks = chunk_text_better(text)# Chunk the text(our document)
        self.embeddings = self.model.encode(self.chunks, show_progress_bar=False, convert_to_numpy=True)
        # Embed the chunks


    # Answer the questions based on ingested document
    def answer(self, question, top_k=3):
        if self.embeddings is None:
            return 'No document loaded.'
        # Get top relevant chunks
        results = retrieve_chunks(question, self.chunks, self.embeddings, top_k=top_k)
        return extract_best_sentences(question, results, max_sentences=3)

bot = MiniDocChatbot(model)# create chatbot instance
bot.ingest(text)# ingest the document text
bot.answer('Explain supervised learning in simple terms')# Get answer to a sample question


'● Supervised learning involves training a model using labeled data. In supervised learning, the algorithm is trained on labeled data, where the target variable is known. Supervised learning trains a machine learning model on labeled data, where inputs and outputs are known.'

We see that the best possible response is made available by the chatbot.