## Homework Week 3
Tasks:
1. Design a Retriever class
• Methods: add_documents(), query(), save(), load()
• Let it handle chunking + indexing internally
• Use FAISS and SentenceTransformers under the hood
2. Feed Real Documents
• Try .txt, .md, or .pdf files (if ambitious)
• Preprocess and chunk them (include a chunking function)
3. Write Tests
• Given a document and a query, does your retriever return the expected chunk?
4. Document Your Code
• Add docstrings, usage examples, and push to GitHub

Deliverables:
1. A retriever.py module with reusable class
2. One or more loaded document sources
3. Working local search using queries
4. Committed README update and usage instructions

In [8]:
%pip install sentence-transformers
%pip install faiss-cpu
%pip install PyMuPDF

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
import os
import pickle
from pathlib import Path
import fitz  # PyMuPDF
import faiss
from sentence_transformers import SentenceTransformer
from typing import List
from transformers import AutoTokenizer


In [None]:
class Retriever:
    """
    Retriever class for document indexing and semantic search using FAISS and SentenceTransformers.
    """
    # Initialize Retriever with embedding model and chunk parameters, so they can be reused across methods.
    def __init__(self, model_name: str = "all-MiniLM-L6-v2", chunk_size: int = 200, chunk_overlap: int = 30):
        
        self.model = SentenceTransformer(model_name)
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.documents: List[str] = []
        self.index = None
    
    """
    Load document from file path and return text content(String).
    Supports .txt, .md, and .pdf formats.
    For .pdf, uses PyMuPDF to extract text.
    For .txt and .md, reads the file directly.
    Raises ValueError for unsupported file formats.
    """
    def load_document(self, file_path: str) -> str:
        
        suffix = Path(file_path).suffix.lower()
        if suffix in (".txt", ".md"):
            return Path(file_path).read_text(encoding="utf-8")
        elif suffix == ".pdf":
            doc = fitz.open(file_path)
            return "\n".join(page.get_text() for page in doc)
        else:
            raise ValueError(f"Unsupported file format: {suffix}")
    

    #Split text into overlapping chunks.
    def chunk_text(self, text: str) -> List[str]:
        
        tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
        tokens = tokenizer.tokenize(text)
        chunks = []
        step = self.chunk_size - self.chunk_overlap
        for i in range(0, len(tokens), step):
            chunk = tokens[i:i + self.chunk_size]
            chunks.append(" ".join(chunk))
        return chunks
    
    # Add documents to the retriever.
    # Load, chunk, embed, and index documents.
    def add_documents(self, file_paths: List[str]):
        
        all_chunks = []
        for path in file_paths:
            raw_text = self.load_document(path)
            chunks = self.chunk_text(raw_text)
            all_chunks.extend(chunks)

        self.documents = all_chunks

        # Create embeddings and build FAISS index
        embeddings = self.model.encode(all_chunks, show_progress_bar=True)
        dim = embeddings.shape[1]
        self.index = faiss.IndexFlatL2(dim)
        self.index.add(embeddings)

    # Query the indexed documents and return top_k most relevant chunks.
    def query(self, query_text: str, top_k: int = 3) -> List[str]:
        
        if self.index is None:
            raise ValueError("Index not initialized. Add documents first.")
        query_emb = self.model.encode([query_text])
        distances, indices = self.index.search(query_emb, top_k)
        return [self.documents[i] for i in indices[0]]
    
    #Save documents list and FAISS index to disk.
    def save(self, folder: str):
        
        os.makedirs(folder, exist_ok=True)
        with open(os.path.join(folder, "documents.pkl"), "wb") as f:
            pickle.dump(self.documents, f)
        faiss.write_index(self.index, os.path.join(folder, "index.faiss"))


    #Load documents list and FAISS index from disk.

    def load(self, folder: str):
        """
        Load documents list and FAISS index from disk.

        Args:
            folder (str): Directory path where files are saved.
        """
        with open(os.path.join(folder, "documents.pkl"), "rb") as f:
            self.documents = pickle.load(f)
        self.index = faiss.read_index(os.path.join(folder, "index.faiss"))


## Test

In [11]:
retriever = Retriever(chunk_size=100, chunk_overlap=20)

In [12]:
retriever.add_documents(["Rotkäppchen.pdf"])

Batches: 100%|██████████| 1/1 [00:01<00:00,  1.28s/it]


In [14]:
query = "Ist Rotkäppchen ein Mädchen?" 
results = retriever.query(query, top_k=3)

for i, chunk in enumerate(results):
    print(f"Result {i+1}:\n{chunk}\n")


Result 1:
Rotkäppchen Es war einmal ein kleines süßes Mädchen, das hatte jedermann lieb, der sie nur ansah, am allerliebsten aber ihre Großmutter, die wusste gar nicht, was sie alles dem Kinde geben sollte. Einmal schenkte sie ihm ein Käppchen von rotem Samt, und weil ihm das so wohl stand, und es nichts anders mehr tragen wollte, hieß es nur das Rotkäppchen. Eines Tages sprach seine Mutter zu ihm: "Komm, Rotkäppchen, da hast du ein Stück Kuchen und eine Flasche Wein, bring das der Großmutter hinaus; sie ist krank und schwach und wird sich daran laben. Mach dich auf, bevor es heiß

Result 2:
gab ihr die Hand darauf. Die Großmutter aber wohnte draußen im Wald, eine halbe Stunde vom Dorf. Wie nun Rotkäppchen in den Wald kam, begegnete ihm der Wolf. Rotkäppchen aber wusste nicht, was das für ein böses Tier war, und fürchtete sich nicht vor ihm. "Guten Tag, Rotkäppchen!" sprach er. "Schönen Dank, Wolf!" - "Wo hinaus so früh, Rotkäppchen?" - "Zur Großmutter." - "Was trägst du unter der Schü