# Enterprise RAG System - Ingestion Demo

This notebook demonstrates the document ingestion pipeline:
1. PDF parsing with structure extraction
2. Table extraction (never vectorized)
3. Hybrid chunking (single canonical pass)
4. Multi-store population (FAISS, Neo4j, SQLite, BM25)

All stores use the SAME chunks with SAME chunk_ids.

In [None]:
# Add project root to path
import sys
sys.path.insert(0, '..')

# Initialize settings
from config.settings import settings
settings.initialize()

## 1. PDF Parsing

The parser extracts:
- Document structure (SEC Items, Parts, Sections)
- Text content with page numbers
- Tables as separate elements

In [None]:
from src.document.parser import PDFParser
from pathlib import Path

# Initialize parser
parser = PDFParser()

# Parse a sample PDF (replace with your file)
# pdf_path = Path('../data/uploads/sample_10k.pdf')
# document = parser.parse(pdf_path)

# For demo, we'll show the structure
print("Parser initialized")
print(f"Detects SEC Items, Parts, and section structure")
print(f"Tracks page numbers for citations")

## 2. Hybrid Chunking

Single canonical chunking pass that produces chunks used by ALL stores.

Strategy:
1. Structural boundaries (SEC Items, Parts)
2. Semantic boundaries (paragraphs, concepts)
3. Token limits with overlap
4. Table placeholders (tables never embedded)

In [None]:
from src.document.chunker import HybridChunker

chunker = HybridChunker(
    max_tokens=512,
    min_tokens=50,
    overlap_tokens=64
)

print("Chunker configuration:")
print(f"  Max tokens: {chunker.max_tokens}")
print(f"  Min tokens: {chunker.min_tokens}")
print(f"  Overlap: {chunker.overlap_tokens}")
print(f"  Table placeholder format: {chunker.table_placeholder_format}")

## 3. Table Extraction

Tables are NEVER vectorized. They are:
1. Extracted to structured JSON
2. Stored in SQLite
3. Linked to explanatory chunks via table_id

In [None]:
from src.document.table_extractor import TableExtractor

extractor = TableExtractor()

print("Table extractor features:")
print("  - Column type inference (text, number, currency, percentage)")
print("  - Schema generation for SQL")
print("  - Markdown formatting for LLM")
print("  - Link to context chunks")

## 4. Full Ingestion Pipeline

Orchestrates all steps:
1. Parse → Chunk → Extract Tables
2. Populate Vector Store (FAISS)
3. Populate Knowledge Graph (Neo4j)
4. Populate Table Store (SQLite)
5. Build BM25 Index

In [None]:
from src.pipeline.ingestion import IngestionPipeline

pipeline = IngestionPipeline()

print("Ingestion pipeline initialized")
print(f"  Vector store: {type(pipeline.vector_store).__name__}")
print(f"  BM25 index: {type(pipeline.bm25_index).__name__}")
print(f"  Table store: {type(pipeline.table_store).__name__}")
print(f"  Knowledge graph: {type(pipeline.knowledge_graph).__name__}")

In [None]:
# Example ingestion (uncomment with real file)
# result = pipeline.ingest('../data/uploads/sample_10k.pdf')
# 
# print(f"Document ID: {result.doc_id}")
# print(f"Title: {result.title}")
# print(f"Pages: {result.page_count}")
# print(f"Chunks created: {result.chunks_created}")
# print(f"Tables extracted: {result.tables_extracted}")
# print(f"Entities extracted: {result.entities_extracted}")
# print(f"Processing time: {result.processing_time_seconds:.2f}s")

## 5. Verify Store Contents

In [None]:
# Check store counts
print("Store statistics:")
print(f"  Vector store: {pipeline.vector_store.count} vectors")
print(f"  BM25 index: {pipeline.bm25_index.count} chunks")
print(f"  Table store: {pipeline.table_store.count} tables")
print(f"  Knowledge graph: {pipeline.knowledge_graph.node_count} nodes")

## Key Design Decisions

### Why Single Canonical Chunking?
- Ensures consistency across all stores
- Every store references the SAME chunk_id
- Enables cross-store deduplication

### Why Tables are NOT Vectorized?
- Embeddings don't capture tabular structure
- Numerical precision lost in embedding
- SQL enables precise queries
- LLMs understand structured tables better

### Why Neo4j is MANDATORY?
- Captures entity relationships
- Enables graph-based retrieval
- Document structure representation
- Cross-document knowledge linking