## Homework Week 3
Tasks:
1. Design a Retriever class
• Methods: add_documents(), query(), save(), load()
• Let it handle chunking + indexing internally
• Use FAISS and SentenceTransformers under the hood
2. Feed Real Documents
• Try .txt, .md, or .pdf files (if ambitious)
• Preprocess and chunk them (include a chunking function)
3. Write Tests
• Given a document and a query, does your retriever return the expected chunk?
4. Document Your Code
• Add docstrings, usage examples, and push to GitHub

Deliverables:
1. A retriever.py module with reusable class
2. One or more loaded document sources
3. Working local search using queries
4. Committed README update and usage instructions

In [None]:
%pip install sentence-transformers
%pip install faiss-cpu
%pip install PyMuPDF

In [3]:
import os
import pickle
from pathlib import Path
import fitz  # PyMuPDF
import faiss
from sentence_transformers import SentenceTransformer
from typing import List
from transformers import AutoTokenizer


In [4]:
class Retriever:
    """
    Retriever class for document indexing and semantic search using FAISS and SentenceTransformers.
    """
    # Initialize Retriever with embedding model and chunk parameters, so they can be reused across methods.
    def __init__(self, model_name: str = "all-MiniLM-L6-v2", chunk_size: int = 200, chunk_overlap: int = 30):
        
        self.model = SentenceTransformer(model_name)
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.documents: List[str] = []
        self.index = None
    
    """
    Load document from file path and return text content(String).
    Supports .txt, .md, and .pdf formats.
    For .pdf, uses PyMuPDF to extract text.
    For .txt and .md, reads the file directly.
    Raises ValueError for unsupported file formats.
    """
    def load_document(self, file_path: str) -> str:
        
        suffix = Path(file_path).suffix.lower()
        if suffix in (".txt", ".md"):
            return Path(file_path).read_text(encoding="utf-8")
        elif suffix == ".pdf":
            doc = fitz.open(file_path)
            return "\n".join(page.get_text() for page in doc)
        else:
            raise ValueError(f"Unsupported file format: {suffix}")
    

    #Split text into overlapping chunks.
    def chunk_text(self, text: str) -> List[str]:
        
        tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
        tokens = tokenizer.tokenize(text)
        chunks = []
        step = self.chunk_size - self.chunk_overlap
        for i in range(0, len(tokens), step):
            chunk = tokens[i:i + self.chunk_size]
            chunks.append(" ".join(chunk))
        return chunks
    
    # Add documents to the retriever.
    # Load, chunk, embed, and index documents.
    def add_documents(self, file_paths: List[str]):
        
        all_chunks = []
        for path in file_paths:
            raw_text = self.load_document(path)
            chunks = self.chunk_text(raw_text)
            all_chunks.extend(chunks)

        self.documents = all_chunks

        # Create embeddings and build FAISS index
        embeddings = self.model.encode(all_chunks, show_progress_bar=True)
        dim = embeddings.shape[1]
        self.index = faiss.IndexFlatL2(dim)
        self.index.add(embeddings)

    # Query the indexed documents and return top_k most relevant chunks.
    def query(self, query_text: str, top_k: int = 3) -> List[str]:
        
        if self.index is None:
            raise ValueError("Index not initialized. Add documents first.")
        query_emb = self.model.encode([query_text])
        distances, indices = self.index.search(query_emb, top_k)
        return [self.documents[i] for i in indices[0]]
    
    #Save documents list and FAISS index to disk.
    def save(self, folder: str):
        
        os.makedirs(folder, exist_ok=True)
        with open(os.path.join(folder, "documents.pkl"), "wb") as f:
            pickle.dump(self.documents, f)
        faiss.write_index(self.index, os.path.join(folder, "index.faiss"))


    #Load documents list and FAISS index from disk.

    def load(self, folder: str):
        """
        Load documents list and FAISS index from disk.

        Args:
            folder (str): Directory path where files are saved.
        """
        with open(os.path.join(folder, "documents.pkl"), "rb") as f:
            self.documents = pickle.load(f)
        self.index = faiss.read_index(os.path.join(folder, "index.faiss"))


## Test

In [5]:
retriever = Retriever(chunk_size=100, chunk_overlap=20)

In [7]:
retriever.add_documents(["docs/Rotkaeppchen.pdf"])

Token indices sequence length is longer than the specified maximum sequence length for this model (2770 > 512). Running this sequence through the model will result in indexing errors
Batches: 100%|██████████| 2/2 [00:01<00:00,  1.25it/s]


In [8]:
query = "Ist Rotkäppchen ein Mädchen?" 
results = retriever.query(query, top_k=3)

for i, chunk in enumerate(results):
    print(f"Result {i+1}:\n{chunk}\n")


Result 1:
er ##tra ##nk . rot ##ka ##pp ##chen abe ##r gin ##g fr ##oh ##lich na ##ch ha ##us , und von nun an ta ##t i ##hm ni ##eman ##d me ##hr et ##was zu ##lei ##de . * * * 2 www . grimm ##stor ##ies . com

Result 2:
" rot ##ka ##pp ##chen , si ##eh ein ##mal die sc ##hone ##n blu ##men , die rings ##um ##her ste ##hen . war ##um gu ##cks ##t du di ##ch nic ##ht um ? ich g ##lau ##be , du horst ga ##r nic ##ht , wi ##e die vo ##gle ##in so lie ##bl ##ich sing ##en ? du ge ##hs ##t ja fur di ##ch hi ##n , als wen ##n du zur sc ##hul ##e gin ##gs ##t , und ist so lust ##ig ha ##uss ##en in dem wal ##d . " rot ##ka

Result 3:
mach ##en , " sa ##gt ##e rot ##ka ##pp ##chen zur mutter , und ga ##b i ##hr die hand dar ##au ##f . die gr ##o ##ß ##mut ##ter abe ##r wo ##hn ##te dr ##au ##ße ##n im wal ##d , eine hal ##be stu ##nde vo ##m do ##rf . wi ##e nun rot ##ka ##pp ##chen in den wal ##d kam , beg ##eg ##net ##e i ##hm der wolf . rot ##ka ##pp ##chen abe ##r wu ##ss ##te nic ##ht , 

In [9]:
# conda install langchain-community llama-cpp-python

In [10]:
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler

# download model from https://huggingface.co/MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF/blob/main/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf
llm = LlamaCpp(
    model_path="model/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf",
    n_gpu_layers=0,
    n_batch=512,
    n_ctx=2048,
    f16_kv=True,
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
    verbose=True,
)

llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from model/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32768,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_K     [  4096,  4096,

In [11]:
prompt = f"""
Answer the following question using only the context provided. Context:
{results[0]}
{results[1]}
{results[2]}
Question:
{query}
"""
llm.invoke(prompt)

Answer:
 Ja, Rotkäppchen ist ein Mädchen.


llama_print_timings:        load time =   80026.81 ms
llama_print_timings:      sample time =       8.39 ms /    18 runs   (    0.47 ms per token,  2144.90 tokens per second)
llama_print_timings: prompt eval time =   80025.15 ms /   412 tokens (  194.24 ms per token,     5.15 tokens per second)
llama_print_timings:        eval time =   11453.41 ms /    17 runs   (  673.73 ms per token,     1.48 tokens per second)
llama_print_timings:       total time =   91583.95 ms


'Answer:\n Ja, Rotkäppchen ist ein Mädchen.'