# Data Ingestion - Step-by-Step Notebook

This notebook converts `src/document_ingestion/data_ingestion.py` into a procedural, debuggable format.

## What This File Contains
Four major components for document handling:

| Class | Purpose |
|-------|--------|
| `FaissManager` | Manages FAISS vector index (load/create/add documents) |
| `ChatIngestor` | Ingests documents for chat - splits, embeds, builds retriever |
| `DocHandler` | Saves and reads PDFs for document analysis |
| `DocumentComparator` | Saves, reads, and combines PDFs for comparison |

## Prerequisites
- API keys set: `GOOGLE_API_KEY` (for embeddings)
- Config file at `config/config.yaml`
- PyMuPDF (`fitz`) installed: `pip install pymupdf`

---
## Cell 1: Configuration Placeholders

**Purpose:** Define all configurable paths and parameters.

**External Dependencies:**
- `DATA_STORAGE_PATH` env var (optional)
- Directories will be created automatically if they don't exist

In [None]:
# ============================================================
# CONFIGURATION PLACEHOLDERS - MODIFY THESE BEFORE RUNNING
# ============================================================

# Base directories
TEMP_BASE = "data"                          # Where uploaded files are temporarily stored
FAISS_BASE = "faiss_index"                  # Where FAISS indexes are stored
ANALYSIS_DIR = "data/document_analysis"     # For DocHandler
COMPARE_DIR = "data/document_compare"       # For DocumentComparator

# Session handling
USE_SESSION_DIRS = True   # Create session-specific subdirectories
SESSION_ID = None         # Auto-generate if None, or set specific ID like "test_session_001"

# Chunking parameters (for ChatIngestor)
CHUNK_SIZE = 1000         # Characters per chunk
CHUNK_OVERLAP = 200       # Overlap between chunks
RETRIEVER_K = 5           # Number of documents to retrieve

# Supported file types
SUPPORTED_EXTENSIONS = {".pdf", ".docx", ".txt"}

print("Configuration loaded:")
print(f"  TEMP_BASE: {TEMP_BASE}")
print(f"  FAISS_BASE: {FAISS_BASE}")
print(f"  CHUNK_SIZE: {CHUNK_SIZE}, OVERLAP: {CHUNK_OVERLAP}")
print(f"  RETRIEVER_K: {RETRIEVER_K}")

---
## Cell 2: Imports

**Purpose:** Import all required libraries and modules.

**Key Dependencies:**
- `fitz` (PyMuPDF): For reading PDF files
- `langchain`: Document schemas, text splitters, FAISS vectorstore
- `utils.model_loader.ModelLoader`: Loads embedding model
- `utils.file_io`: Session ID generation, file saving utilities

In [None]:
from __future__ import annotations
import os
import sys
import json
import uuid
import hashlib
import shutil
from pathlib import Path
from typing import Iterable, List, Optional, Dict, Any

# PDF handling
import fitz  # PyMuPDF

# LangChain imports
from langchain.schema import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

# Project imports
from utils.model_loader import ModelLoader
from logger import GLOBAL_LOGGER as log
from exception.custom_exception import DocumentPortalException
from utils.file_io import generate_session_id, save_uploaded_files
from utils.document_ops import load_documents, concat_for_analysis, concat_for_comparison

print("All imports successful!")

---
## Cell 3: Initialize Model Loader & Embeddings

**Purpose:** Load the embedding model that will be used to convert text into vectors.

**What happens:**
- `ModelLoader` reads `config/config.yaml`
- Loads Google Generative AI Embeddings (or configured alternative)

In [None]:
# Initialize model loader and embeddings
try:
    model_loader = ModelLoader()
    embeddings = model_loader.load_embeddings()
    
    log.info("Embeddings loaded successfully")
    print(f"Embeddings loaded: {type(embeddings).__name__}")
    
except Exception as e:
    print(f"ERROR loading embeddings: {e}")
    raise DocumentPortalException("Failed to load embeddings", sys)

---
## Cell 4: Generate Session ID

**Purpose:** Create or use a session identifier for organizing files.

Sessions help:
- Isolate different users/conversations
- Prevent file conflicts
- Enable easy cleanup of old data

In [None]:
# Generate or use provided session ID
session_id = SESSION_ID or generate_session_id()

print(f"Session ID: {session_id}")
print(f"This will be used to organize files in subdirectories.")

---
# PART 1: FAISS Manager Functions

The FAISS Manager handles:
- Creating new FAISS indexes
- Loading existing indexes from disk
- Adding documents idempotently (no duplicates)
- Tracking ingested documents via metadata

---
## Cell 5: FAISS Helper - Fingerprint Function

**Purpose:** Generate a unique identifier for each document chunk to prevent duplicates.

Uses either:
- Source file path + row ID (if available)
- SHA256 hash of content (fallback)

In [None]:
def fingerprint(text: str, metadata: Dict[str, Any]) -> str:
    """
    Generate unique fingerprint for a document chunk.
    Used to detect and skip duplicates during ingestion.
    
    Args:
        text: Document content
        metadata: Document metadata (may contain source, file_path, row_id)
        
    Returns:
        str: Unique identifier for this chunk
    """
    src = metadata.get("source") or metadata.get("file_path")
    rid = metadata.get("row_id")
    if src is not None:
        return f"{src}::{'' if rid is None else rid}"
    return hashlib.sha256(text.encode("utf-8")).hexdigest()

# Test fingerprint
print("Testing fingerprint function:")
print(f"  With source: {fingerprint('content', {'source': 'doc.pdf', 'row_id': 1})}")
print(f"  Without source: {fingerprint('unique content', {})[:20]}...")

---
## Cell 6: FAISS Manager - Setup Variables

**Purpose:** Initialize variables for managing the FAISS index.

**Key components:**
- `index_dir`: Where FAISS files are stored
- `meta_path`: JSON file tracking ingested documents
- `vs`: The FAISS vectorstore instance

In [None]:
# FAISS Manager setup
faiss_index_dir = Path(FAISS_BASE) / session_id if USE_SESSION_DIRS else Path(FAISS_BASE)
faiss_index_dir.mkdir(parents=True, exist_ok=True)

meta_path = faiss_index_dir / "ingested_meta.json"
faiss_meta: Dict[str, Any] = {"rows": {}}

# Load existing metadata if present
if meta_path.exists():
    try:
        faiss_meta = json.loads(meta_path.read_text(encoding="utf-8")) or {"rows": {}}
        print(f"Loaded existing metadata: {len(faiss_meta['rows'])} documents tracked")
    except Exception:
        faiss_meta = {"rows": {}}
        print("Could not load metadata, starting fresh")
else:
    print("No existing metadata found, starting fresh")

# Vectorstore placeholder
vectorstore: Optional[FAISS] = None

print(f"\nFAISS index directory: {faiss_index_dir}")
print(f"Metadata path: {meta_path}")

---
## Cell 7: FAISS Helper Functions

**Purpose:** Define helper functions for FAISS operations.

Functions:
- `faiss_exists()`: Check if index files exist on disk
- `save_meta()`: Persist metadata to JSON file

In [None]:
def faiss_exists() -> bool:
    """Check if FAISS index files exist on disk."""
    return (faiss_index_dir / "index.faiss").exists() and (faiss_index_dir / "index.pkl").exists()

def save_meta():
    """Save metadata to JSON file."""
    meta_path.write_text(json.dumps(faiss_meta, ensure_ascii=False, indent=2), encoding="utf-8")

print(f"FAISS index exists: {faiss_exists()}")

---
## Cell 8: FAISS - Load or Create Index

**Purpose:** Load existing FAISS index from disk, or create a new one.

**Logic:**
1. If index exists on disk → load it
2. If no index and texts provided → create new index
3. If no index and no texts → raise error

In [None]:
def load_or_create_faiss(texts: Optional[List[str]] = None, 
                          metadatas: Optional[List[dict]] = None) -> FAISS:
    """
    Load existing FAISS index or create new one from texts.
    
    Args:
        texts: List of text strings to index (required if creating new)
        metadatas: Optional metadata for each text
        
    Returns:
        FAISS: The vectorstore instance
    """
    global vectorstore
    
    # Try to load existing index
    if faiss_exists():
        vectorstore = FAISS.load_local(
            str(faiss_index_dir),
            embeddings=embeddings,
            allow_dangerous_deserialization=True,
        )
        log.info("FAISS index loaded from disk", path=str(faiss_index_dir))
        print(f"Loaded existing FAISS index from: {faiss_index_dir}")
        return vectorstore
    
    # Create new index if texts provided
    if not texts:
        raise DocumentPortalException("No existing FAISS index and no data to create one", sys)
    
    vectorstore = FAISS.from_texts(texts=texts, embedding=embeddings, metadatas=metadatas or [])
    vectorstore.save_local(str(faiss_index_dir))
    log.info("New FAISS index created", path=str(faiss_index_dir), docs=len(texts))
    print(f"Created new FAISS index at: {faiss_index_dir} ({len(texts)} documents)")
    return vectorstore

print("load_or_create_faiss function defined.")

---
## Cell 9: FAISS - Add Documents Idempotently

**Purpose:** Add new documents to FAISS index, skipping duplicates.

**Idempotent behavior:**
- Uses fingerprints to detect already-ingested documents
- Only adds truly new documents
- Updates metadata tracking

In [None]:
def add_documents_to_faiss(docs: List[Document]) -> int:
    """
    Add documents to FAISS index, skipping duplicates.
    
    Args:
        docs: List of LangChain Document objects
        
    Returns:
        int: Number of new documents actually added
    """
    global faiss_meta
    
    if vectorstore is None:
        raise RuntimeError("Call load_or_create_faiss() before add_documents_to_faiss()")
    
    new_docs: List[Document] = []
    
    for d in docs:
        key = fingerprint(d.page_content, d.metadata or {})
        if key in faiss_meta["rows"]:
            continue  # Skip duplicate
        faiss_meta["rows"][key] = True
        new_docs.append(d)
    
    if new_docs:
        vectorstore.add_documents(new_docs)
        vectorstore.save_local(str(faiss_index_dir))
        save_meta()
        log.info("Documents added to FAISS", added=len(new_docs), skipped=len(docs)-len(new_docs))
    
    return len(new_docs)

print("add_documents_to_faiss function defined.")

---
# PART 2: Chat Ingestor Functions

The Chat Ingestor handles:
- Saving uploaded files
- Loading documents (PDF, DOCX, TXT)
- Splitting into chunks
- Building a retriever for RAG

---
## Cell 10: Chat Ingestor - Setup Directories

**Purpose:** Set up directory structure for document ingestion.

In [None]:
# Chat Ingestor directory setup
temp_base = Path(TEMP_BASE)
temp_base.mkdir(parents=True, exist_ok=True)

faiss_base = Path(FAISS_BASE)
faiss_base.mkdir(parents=True, exist_ok=True)

# Session-specific directories
if USE_SESSION_DIRS:
    temp_dir = temp_base / session_id
    faiss_dir = faiss_base / session_id
else:
    temp_dir = temp_base
    faiss_dir = faiss_base

temp_dir.mkdir(parents=True, exist_ok=True)
faiss_dir.mkdir(parents=True, exist_ok=True)

print(f"Temp directory: {temp_dir}")
print(f"FAISS directory: {faiss_dir}")

---
## Cell 11: Chat Ingestor - Split Documents

**Purpose:** Split documents into smaller chunks for better retrieval.

**Why chunking matters:**
- LLMs have context limits
- Smaller chunks = more precise retrieval
- Overlap ensures context isn't lost at boundaries

In [None]:
def split_documents(docs: List[Document], 
                    chunk_size: int = CHUNK_SIZE, 
                    chunk_overlap: int = CHUNK_OVERLAP) -> List[Document]:
    """
    Split documents into smaller chunks.
    
    Args:
        docs: List of LangChain Document objects
        chunk_size: Maximum characters per chunk
        chunk_overlap: Overlap between consecutive chunks
        
    Returns:
        List[Document]: Chunked documents
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, 
        chunk_overlap=chunk_overlap
    )
    chunks = splitter.split_documents(docs)
    log.info("Documents split", 
             input_docs=len(docs), 
             output_chunks=len(chunks), 
             chunk_size=chunk_size, 
             overlap=chunk_overlap)
    return chunks

print(f"split_documents function defined (chunk_size={CHUNK_SIZE}, overlap={CHUNK_OVERLAP})")

---
## Cell 12: Chat Ingestor - Build Retriever

**Purpose:** Complete pipeline to ingest uploaded files and create a retriever.

**Pipeline:**
```
Uploaded Files → Save to Disk → Load as Documents → Split into Chunks
    → Create/Update FAISS Index → Return Retriever
```

In [None]:
def build_retriever(uploaded_files: Iterable,
                    chunk_size: int = CHUNK_SIZE,
                    chunk_overlap: int = CHUNK_OVERLAP,
                    k: int = RETRIEVER_K):
    """
    Build a retriever from uploaded files.
    
    Args:
        uploaded_files: Iterable of file-like objects
        chunk_size: Characters per chunk
        chunk_overlap: Overlap between chunks
        k: Number of documents to retrieve per query
        
    Returns:
        Retriever: FAISS-based retriever ready for RAG
    """
    global vectorstore, faiss_meta
    
    try:
        # Step 1: Save uploaded files
        paths = save_uploaded_files(uploaded_files, temp_dir)
        print(f"Step 1: Saved {len(paths)} files to {temp_dir}")
        
        # Step 2: Load documents
        docs = load_documents(paths)
        if not docs:
            raise ValueError("No valid documents loaded")
        print(f"Step 2: Loaded {len(docs)} documents")
        
        # Step 3: Split into chunks
        chunks = split_documents(docs, chunk_size=chunk_size, chunk_overlap=chunk_overlap)
        print(f"Step 3: Split into {len(chunks)} chunks")
        
        # Step 4: Prepare for FAISS
        texts = [c.page_content for c in chunks]
        metas = [c.metadata for c in chunks]
        
        # Step 5: Create/load FAISS index
        # Reset index dir to match current session
        global faiss_index_dir, meta_path
        faiss_index_dir = faiss_dir
        meta_path = faiss_index_dir / "ingested_meta.json"
        
        if meta_path.exists():
            faiss_meta = json.loads(meta_path.read_text(encoding="utf-8")) or {"rows": {}}
        else:
            faiss_meta = {"rows": {}}
        
        vs = load_or_create_faiss(texts=texts, metadatas=metas)
        print(f"Step 4: FAISS index ready at {faiss_index_dir}")
        
        # Step 6: Add documents (idempotent)
        added = add_documents_to_faiss(chunks)
        print(f"Step 5: Added {added} new documents (skipped duplicates)")
        
        # Step 7: Return retriever
        retriever = vs.as_retriever(search_type="similarity", search_kwargs={"k": k})
        log.info("Retriever built successfully", k=k, index=str(faiss_dir))
        print(f"Step 6: Retriever ready (k={k})")
        
        return retriever
        
    except Exception as e:
        print(f"ERROR building retriever: {e}")
        log.error("Failed to build retriever", error=str(e))
        raise DocumentPortalException("Failed to build retriever", e) from e

print("build_retriever function defined.")

---
# PART 3: DocHandler Functions

DocHandler is for **document analysis** - simpler than chat ingestion:
- Save a single PDF
- Read it page-by-page
- Return text content for analysis

---
## Cell 13: DocHandler - Setup

**Purpose:** Set up directory for document analysis storage.

In [None]:
# DocHandler setup
doc_handler_data_dir = os.getenv("DATA_STORAGE_PATH", ANALYSIS_DIR)
doc_handler_session_id = session_id
doc_handler_session_path = os.path.join(doc_handler_data_dir, doc_handler_session_id)
os.makedirs(doc_handler_session_path, exist_ok=True)

print(f"DocHandler session path: {doc_handler_session_path}")

---
## Cell 14: DocHandler - Save PDF

**Purpose:** Save an uploaded PDF file to the session directory.

In [None]:
def save_pdf_for_analysis(uploaded_file) -> str:
    """
    Save uploaded PDF to session directory for analysis.
    
    Args:
        uploaded_file: File-like object with .name attribute
        
    Returns:
        str: Path where file was saved
    """
    try:
        filename = os.path.basename(uploaded_file.name)
        if not filename.lower().endswith(".pdf"):
            raise ValueError("Invalid file type. Only PDFs are allowed.")
        
        save_path = os.path.join(doc_handler_session_path, filename)
        
        with open(save_path, "wb") as f:
            if hasattr(uploaded_file, "read"):
                f.write(uploaded_file.read())
            else:
                f.write(uploaded_file.getbuffer())
        
        log.info("PDF saved for analysis", file=filename, path=save_path)
        return save_path
        
    except Exception as e:
        print(f"ERROR saving PDF: {e}")
        log.error("Failed to save PDF", error=str(e))
        raise DocumentPortalException(f"Failed to save PDF: {str(e)}", e) from e

print("save_pdf_for_analysis function defined.")

---
## Cell 15: DocHandler - Read PDF

**Purpose:** Read PDF content page-by-page using PyMuPDF.

**Output format:**
```
--- Page 1 ---
[page 1 content]

--- Page 2 ---
[page 2 content]
...
```

In [None]:
def read_pdf_for_analysis(pdf_path: str) -> str:
    """
    Read PDF content page-by-page.
    
    Args:
        pdf_path: Path to the PDF file
        
    Returns:
        str: Full text content with page markers
    """
    try:
        text_chunks = []
        with fitz.open(pdf_path) as doc:
            for page_num in range(doc.page_count):
                page = doc.load_page(page_num)
                text_chunks.append(f"\n--- Page {page_num + 1} ---\n{page.get_text()}")
        
        text = "\n".join(text_chunks)
        log.info("PDF read for analysis", path=pdf_path, pages=len(text_chunks))
        return text
        
    except Exception as e:
        print(f"ERROR reading PDF: {e}")
        log.error("Failed to read PDF", error=str(e), path=pdf_path)
        raise DocumentPortalException(f"Could not process PDF: {pdf_path}", e) from e

print("read_pdf_for_analysis function defined.")

---
# PART 4: Document Comparator Functions

DocumentComparator handles **comparing two PDF documents**:
- Save both PDFs to session directory
- Read each PDF
- Combine content for comparison

---
## Cell 16: DocumentComparator - Setup

**Purpose:** Set up directory for document comparison storage.

In [None]:
# DocumentComparator setup
compare_base_dir = Path(COMPARE_DIR)
compare_session_id = session_id
compare_session_path = compare_base_dir / compare_session_id
compare_session_path.mkdir(parents=True, exist_ok=True)

print(f"DocumentComparator session path: {compare_session_path}")

---
## Cell 17: DocumentComparator - Save Files

**Purpose:** Save two PDF files (reference and actual) for comparison.

In [None]:
def save_comparison_files(reference_file, actual_file) -> tuple:
    """
    Save two PDF files for comparison.
    
    Args:
        reference_file: The original/reference document
        actual_file: The document to compare against reference
        
    Returns:
        tuple: (reference_path, actual_path)
    """
    try:
        ref_path = compare_session_path / reference_file.name
        act_path = compare_session_path / actual_file.name
        
        for fobj, out in ((reference_file, ref_path), (actual_file, act_path)):
            if not fobj.name.lower().endswith(".pdf"):
                raise ValueError("Only PDF files are allowed.")
            with open(out, "wb") as f:
                if hasattr(fobj, "read"):
                    f.write(fobj.read())
                else:
                    f.write(fobj.getbuffer())
        
        log.info("Comparison files saved", reference=str(ref_path), actual=str(act_path))
        return ref_path, act_path
        
    except Exception as e:
        print(f"ERROR saving comparison files: {e}")
        log.error("Error saving PDF files", error=str(e))
        raise DocumentPortalException("Error saving files", e) from e

print("save_comparison_files function defined.")

---
## Cell 18: DocumentComparator - Read PDF

**Purpose:** Read PDF for comparison (similar to DocHandler but with encryption check).

In [None]:
def read_pdf_for_comparison(pdf_path: Path) -> str:
    """
    Read PDF content for comparison.
    
    Args:
        pdf_path: Path to the PDF file
        
    Returns:
        str: Text content with page markers
    """
    try:
        with fitz.open(pdf_path) as doc:
            if doc.is_encrypted:
                raise ValueError(f"PDF is encrypted: {pdf_path.name}")
            
            parts = []
            for page_num in range(doc.page_count):
                page = doc.load_page(page_num)
                text = page.get_text()
                if text.strip():
                    parts.append(f"\n --- Page {page_num + 1} --- \n{text}")
        
        log.info("PDF read for comparison", file=str(pdf_path), pages=len(parts))
        return "\n".join(parts)
        
    except Exception as e:
        print(f"ERROR reading PDF: {e}")
        log.error("Error reading PDF", file=str(pdf_path), error=str(e))
        raise DocumentPortalException("Error reading PDF", e) from e

print("read_pdf_for_comparison function defined.")

---
## Cell 19: DocumentComparator - Combine Documents

**Purpose:** Combine all PDFs in session directory into a single string for comparison.

In [None]:
def combine_documents_for_comparison() -> str:
    """
    Combine all PDF files in the comparison session directory.
    
    Returns:
        str: Combined text from all PDFs
    """
    try:
        doc_parts = []
        for file in sorted(compare_session_path.iterdir()):
            if file.is_file() and file.suffix.lower() == ".pdf":
                content = read_pdf_for_comparison(file)
                doc_parts.append(f"Document: {file.name}\n{content}")
        
        combined_text = "\n\n".join(doc_parts)
        log.info("Documents combined", count=len(doc_parts))
        return combined_text
        
    except Exception as e:
        print(f"ERROR combining documents: {e}")
        log.error("Error combining documents", error=str(e))
        raise DocumentPortalException("Error combining documents", e) from e

print("combine_documents_for_comparison function defined.")

---
## Cell 20: DocumentComparator - Clean Old Sessions

**Purpose:** Delete old session directories to free up disk space.

In [None]:
def clean_old_sessions(keep_latest: int = 3):
    """
    Delete old session directories, keeping only the latest N.
    
    Args:
        keep_latest: Number of recent sessions to keep
    """
    try:
        sessions = sorted([f for f in compare_base_dir.iterdir() if f.is_dir()], reverse=True)
        deleted = 0
        for folder in sessions[keep_latest:]:
            shutil.rmtree(folder, ignore_errors=True)
            log.info("Old session deleted", path=str(folder))
            deleted += 1
        print(f"Cleaned {deleted} old sessions (kept latest {keep_latest})")
        
    except Exception as e:
        print(f"ERROR cleaning sessions: {e}")
        log.error("Error cleaning old sessions", error=str(e))
        raise DocumentPortalException("Error cleaning old sessions", e) from e

print("clean_old_sessions function defined.")

---
# PART 5: Testing Cells

---
## Cell 21: Test - Create Sample Document and Test FAISS

**Purpose:** Test FAISS index creation with sample text.

In [None]:
# ============================================================
# TEST: FAISS Index Creation
# ============================================================

# Sample documents
sample_texts = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with many layers.",
    "Natural language processing enables computers to understand text.",
    "Computer vision allows machines to interpret images."
]

sample_metas = [
    {"source": "ml_intro.txt", "row_id": 0},
    {"source": "dl_intro.txt", "row_id": 0},
    {"source": "nlp_intro.txt", "row_id": 0},
    {"source": "cv_intro.txt", "row_id": 0},
]

print("Testing FAISS index creation...")
print("="*60)

# Create/load FAISS index
vs = load_or_create_faiss(texts=sample_texts, metadatas=sample_metas)

# Create Document objects for adding
sample_docs = [Document(page_content=t, metadata=m) for t, m in zip(sample_texts, sample_metas)]
added = add_documents_to_faiss(sample_docs)
print(f"\nAdded {added} new documents")

# Test search
print("\nTesting similarity search:")
query = "What is machine learning?"
results = vs.similarity_search(query, k=2)
print(f"Query: '{query}'")
for i, doc in enumerate(results, 1):
    print(f"  Result {i}: {doc.page_content[:60]}...")

---
## Cell 22: Test - Read Existing PDF (if available)

**Purpose:** Test PDF reading with an existing file.

In [None]:
# ============================================================
# TEST: PDF Reading
# ============================================================

TEST_PDF_PATH = "data/multi_doc_chat/state_of_the_union.txt"  # <-- CHANGE THIS to a real PDF

if os.path.exists(TEST_PDF_PATH) and TEST_PDF_PATH.lower().endswith('.pdf'):
    print(f"Reading: {TEST_PDF_PATH}")
    print("="*60)
    
    content = read_pdf_for_analysis(TEST_PDF_PATH)
    print(f"Total characters: {len(content)}")
    print(f"\nFirst 500 characters:")
    print(content[:500])
else:
    print(f"Test PDF not found: {TEST_PDF_PATH}")
    print("Update TEST_PDF_PATH to test PDF reading.")

---
## Summary

### Key Variables

| Variable | Type | Description |
|----------|------|-------------|
| `embeddings` | Embeddings | Embedding model for vectorization |
| `vectorstore` | FAISS | The vector index |
| `faiss_meta` | dict | Tracks ingested documents |
| `session_id` | str | Current session identifier |

### Functions by Category

**FAISS Manager:**
| Function | Purpose |
|----------|--------|
| `fingerprint(text, meta)` | Generate unique doc ID |
| `load_or_create_faiss(texts, metas)` | Load/create FAISS index |
| `add_documents_to_faiss(docs)` | Add docs idempotently |

**Chat Ingestor:**
| Function | Purpose |
|----------|--------|
| `split_documents(docs)` | Chunk documents |
| `build_retriever(files)` | Full ingestion pipeline |

**DocHandler:**
| Function | Purpose |
|----------|--------|
| `save_pdf_for_analysis(file)` | Save uploaded PDF |
| `read_pdf_for_analysis(path)` | Read PDF content |

**DocumentComparator:**
| Function | Purpose |
|----------|--------|
| `save_comparison_files(ref, act)` | Save two PDFs |
| `read_pdf_for_comparison(path)` | Read PDF for comparison |
| `combine_documents_for_comparison()` | Combine all PDFs |
| `clean_old_sessions(keep)` | Delete old sessions |