# Notebook 2: Document Processing & Search Systems

**Learning Objectives:**
- Process and chunk real financial documents effectively
- Build similarity search from scratch using embeddings
- Create a complete document retrieval system
- Handle large document collections with optimization strategies
- Prepare foundation for Day 2's RAG pipeline

**Duration:** 90 minutes

**Prerequisites:** Complete Notebook 1 (Embeddings & Similarity Concepts)

---

## Setup and Imports

Let's start by importing our dependencies and loading the embeddings from Notebook 1.

In [None]:
# Import required libraries
import numpy as np
import json
import os
import re
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
from datetime import datetime
import openai
from pathlib import Path

# Load environment variables
try:
    from dotenv import load_dotenv
    load_dotenv()
except ImportError:
    print("Install python-dotenv for .env support: pip install python-dotenv")

# Set up OpenAI client
client = openai.OpenAI()

# Load our similarity functions from Notebook 1
def cosine_similarity(vec1: List[float], vec2: List[float]) -> float:
    """Calculate cosine similarity between two vectors"""
    a = np.array(vec1)
    b = np.array(vec2)
    dot_product = np.dot(a, b)
    magnitude_a = np.linalg.norm(a)
    magnitude_b = np.linalg.norm(b)
    if magnitude_a == 0 or magnitude_b == 0:
        return 0
    return dot_product / (magnitude_a * magnitude_b)

def get_embedding(text: str, model: str = "text-embedding-3-small") -> List[float]:
    """Get embedding for text using OpenAI API"""
    try:
        response = client.embeddings.create(model=model, input=text)
        return response.data[0].embedding
    except Exception as e:
        print(f"Error getting embedding: {e}")
        return None

print("✅ Setup complete! Ready to process documents.")

---

## Section 1: Document Chunking Strategies (25 minutes)

### 1.1 Why Chunking Matters (5 minutes)

Before we dive into implementation, let's understand why document chunking is crucial for effective retrieval systems.

In [None]:
# Load comprehensive sample financial documents from external files
docs_dir = Path("sample_financial_docs")

# Verify the documents directory exists
if not docs_dir.exists():
    print(f"❌ Documents directory not found: {docs_dir}")
    print("Please ensure the sample_financial_docs folder exists with the required files.")
else:
    # List available documents
    available_docs = list(docs_dir.glob("*.txt"))
    print(f"📁 Found {len(available_docs)} financial documents:")
    
    for doc_path in available_docs:
        # Get file size for reference
        file_size = doc_path.stat().st_size
        file_size_kb = file_size / 1024
        
        print(f"   📄 {doc_path.name} ({file_size_kb:.1f} KB)")
    
    # Provide overview of document types
    print(f"\n📋 Document Types Available:")
    print(f"   • Earnings Call Transcript (Q3 2024)")
    print(f"   • 10-K Risk Factors Filing")
    print(f"   • Goldman Sachs Analyst Report")
    print(f"   • CEO Annual Shareholder Letter")
    print(f"   • Enterprise Software Market Research Report")
    
    print(f"\n✅ Ready to process comprehensive financial documents!")
    print(f"   Total estimated chunks: 50-65 (vs ~10 with short samples)")
    print(f"   Perfect for demonstrating chunking and search differences")

In [None]:
class DocumentChunker:
    """Handles different document chunking strategies"""
    
    @staticmethod
    def chunk_by_characters(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
        """
        Split text into fixed-size character chunks with optional overlap
        
        Pros: Simple, predictable chunk sizes
        Cons: May break sentences/paragraphs awkwardly
        """
        chunks = []
        start = 0
        
        while start < len(text):
            end = start + chunk_size
            chunk = text[start:end]
            
            # Clean up the chunk
            chunk = chunk.strip()
            if chunk:
                chunks.append(chunk)
            
            # Move start position with overlap
            start = end - overlap
            
        return chunks
    
    @staticmethod
    def chunk_by_sentences(text: str, max_chunk_size: int = 500) -> List[str]:
        """
        Split text by sentences, grouping them to stay under max_chunk_size
        
        Pros: Preserves sentence boundaries, better semantic coherence
        Cons: Variable chunk sizes, complex sentence detection
        """
        # Simple sentence splitting (can be improved with nltk/spacy)
        sentences = re.split(r'(?<=[.!?])\s+', text)
        sentences = [s.strip() for s in sentences if s.strip()]
        
        chunks = []
        current_chunk = ""
        
        for sentence in sentences:
            # Check if adding this sentence would exceed chunk size
            potential_chunk = current_chunk + " " + sentence if current_chunk else sentence
            
            if len(potential_chunk) <= max_chunk_size:
                current_chunk = potential_chunk
            else:
                # Save current chunk and start new one
                if current_chunk:
                    chunks.append(current_chunk.strip())
                current_chunk = sentence
        
        # Don't forget the last chunk
        if current_chunk:
            chunks.append(current_chunk.strip())
            
        return chunks
    
    @staticmethod
    def chunk_by_paragraphs(text: str, max_chunk_size: int = 1000) -> List[str]:
        """
        Split text by paragraphs, combining small ones
        
        Pros: Preserves natural document structure
        Cons: Very variable chunk sizes, may be too large/small
        """
        # Split by double newlines (paragraphs)
        paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
        
        chunks = []
        current_chunk = ""
        
        for paragraph in paragraphs:
            potential_chunk = current_chunk + "\n\n" + paragraph if current_chunk else paragraph
            
            if len(potential_chunk) <= max_chunk_size:
                current_chunk = potential_chunk
            else:
                # Save current chunk and start new one
                if current_chunk:
                    chunks.append(current_chunk.strip())
                current_chunk = paragraph
        
        if current_chunk:
            chunks.append(current_chunk.strip())
            
        return chunks
    
    @staticmethod
    def chunk_with_overlap(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
        """
        Advanced chunking with smart overlap at sentence boundaries
        
        Pros: Maintains context across chunks, respects sentence boundaries
        Cons: More complex, some content duplication
        """
        sentences = re.split(r'(?<=[.!?])\s+', text)
        sentences = [s.strip() for s in sentences if s.strip()]
        
        chunks = []
        current_chunk = ""
        overlap_buffer = ""
        
        for sentence in sentences:
            potential_chunk = current_chunk + " " + sentence if current_chunk else sentence
            
            if len(potential_chunk) <= chunk_size:
                current_chunk = potential_chunk
            else:
                # Save current chunk
                if current_chunk:
                    chunks.append(current_chunk.strip())
                    
                    # Create overlap buffer from end of current chunk
                    words = current_chunk.split()
                    if len(words) > overlap:
                        overlap_buffer = " ".join(words[-overlap:])
                    else:
                        overlap_buffer = current_chunk
                
                # Start new chunk with overlap
                current_chunk = overlap_buffer + " " + sentence if overlap_buffer else sentence
        
        if current_chunk:
            chunks.append(current_chunk.strip())
            
        return chunks

print("✅ DocumentChunker class ready!")

In [None]:
# Load a sample document for testing chunking strategies
sample_doc_path = docs_dir / "earnings_call_q3_2024.txt"

if sample_doc_path.exists():
    with open(sample_doc_path, 'r') as f:
        sample_text = f.read()
    
    print(f"Loaded document: {sample_doc_path.name}")
    print(f"Document length: {len(sample_text):,} characters")
    print(f"Estimated word count: {len(sample_text.split()):,} words")
    print(f"Document preview:\\n{sample_text[:200]}...\\n")
    
    # Test different chunking methods
    chunker = DocumentChunker()
    
    methods = [
        ("Characters (500)", lambda: chunker.chunk_by_characters(sample_text, 500)),
        ("Sentences (500)", lambda: chunker.chunk_by_sentences(sample_text, 500)),
        ("Paragraphs (1000)", lambda: chunker.chunk_by_paragraphs(sample_text, 1000)),
        ("With Overlap (500/50)", lambda: chunker.chunk_with_overlap(sample_text, 500, 50))
    ]
    
    results = {}
    
    for method_name, method_func in methods:
        chunks = method_func()
        results[method_name] = chunks
        
        print(f"\\n{'='*50}")
        print(f"Method: {method_name}")
        print(f"Number of chunks: {len(chunks)}")
        print(f"Chunk sizes: {[len(chunk) for chunk in chunks[:5]]}{'...' if len(chunks) > 5 else ''}")
        print(f"Average chunk size: {np.mean([len(chunk) for chunk in chunks]):.1f} chars")
        print(f"Size std deviation: {np.std([len(chunk) for chunk in chunks]):.1f} chars")
        
        # Show first chunk as example
        print(f"\\nFirst chunk preview:")
        print(f"\\\"{chunks[0][:150]}...\\\"")
    
    print("\\n✅ Chunking comparison complete!")
    print("\\n💡 Notice how different methods create different chunk counts and sizes")
    print("💡 Longer documents show clearer differences between chunking strategies")
    
else:
    print(f"❌ Sample document not found: {sample_doc_path}")
    print("Please ensure the sample_financial_docs directory contains the required files.")

In [None]:
# Load a sample document for testing
with open(docs_dir / "earnings_call_q3_2024.txt", 'r') as f:
    sample_text = f.read()

print(f"Original document length: {len(sample_text)} characters")
print(f"Original document preview:\n{sample_text[:200]}...\n")

# Test different chunking methods
chunker = DocumentChunker()

methods = [
    ("Characters (500)", lambda: chunker.chunk_by_characters(sample_text, 500)),
    ("Sentences (500)", lambda: chunker.chunk_by_sentences(sample_text, 500)),
    ("Paragraphs (1000)", lambda: chunker.chunk_by_paragraphs(sample_text, 1000)),
    ("With Overlap (500/50)", lambda: chunker.chunk_with_overlap(sample_text, 500, 50))
]

results = {}

for method_name, method_func in methods:
    chunks = method_func()
    results[method_name] = chunks
    
    print(f"\n{'='*50}")
    print(f"Method: {method_name}")
    print(f"Number of chunks: {len(chunks)}")
    print(f"Chunk sizes: {[len(chunk) for chunk in chunks]}")
    print(f"Average chunk size: {np.mean([len(chunk) for chunk in chunks]):.1f} chars")
    
    # Show first chunk as example
    print(f"\nFirst chunk preview:")
    print(f"\"{chunks[0][:150]}...\"")

print("\n✅ Chunking comparison complete!")

**💡 Analysis Questions:**

1. **Which method preserves meaning best?** Look at how sentences and ideas are split
2. **Which is most consistent?** Compare chunk size variations
3. **Which handles financial terminology well?** Notice how numbers and technical terms are treated
4. **Which would work best for search?** Consider semantic coherence of chunks

**Key Insights:**
- **Character chunking**: Fast but may break sentences awkwardly
- **Sentence chunking**: Good balance of speed and coherence
- **Paragraph chunking**: Maintains structure but variable sizes
- **Overlap chunking**: Best context preservation but some redundancy

---

## Section 2: Document Processing Pipeline (20 minutes)

Now let's build a complete pipeline to process multiple documents and manage their chunks effectively.

In [None]:
@dataclass
class DocumentInfo:
    """Metadata for a document"""
    filename: str
    title: str
    doc_type: str  # earnings_call, 10k_filing, analyst_report, etc.
    date: Optional[str] = None
    source: Optional[str] = None
    word_count: int = 0
    char_count: int = 0

@dataclass
class ChunkInfo:
    """Metadata for a document chunk"""
    chunk_id: str
    document_id: str
    content: str
    chunk_index: int  # Position within document
    char_count: int
    word_count: int
    embedding: Optional[List[float]] = None

class DocumentProcessor:
    """Complete document processing pipeline"""
    
    def __init__(self, chunk_method: str = "sentences", chunk_size: int = 500):
        self.chunk_method = chunk_method
        self.chunk_size = chunk_size
        self.chunker = DocumentChunker()
        self.documents: Dict[str, DocumentInfo] = {}
        self.chunks: Dict[str, ChunkInfo] = {}
        
    def _detect_document_type(self, filename: str, content: str) -> str:
        """Simple document type detection based on filename and content"""
        filename_lower = filename.lower()
        content_lower = content.lower()
        
        if "earnings" in filename_lower or "earnings call" in content_lower:
            return "earnings_call"
        elif "10k" in filename_lower or "risk factors" in content_lower:
            return "10k_filing"
        elif "analyst" in filename_lower or "price target" in content_lower:
            return "analyst_report"
        elif "news" in filename_lower:
            return "news_article"
        else:
            return "unknown"
    
    def _extract_title(self, filename: str, content: str) -> str:
        """Extract title from document content or use filename"""
        lines = content.strip().split('\n')
        first_line = lines[0].strip()
        
        # If first line is short and looks like a title, use it
        if len(first_line) < 100 and len(first_line.split()) < 15:
            return first_line
        else:
            # Use filename without extension as title
            return Path(filename).stem.replace('_', ' ').title()
    
    def _clean_text(self, text: str) -> str:
        """Clean and normalize text content"""
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text)
        
        # Remove special characters but keep financial symbols
        # Keep: periods, commas, parentheses, dollar signs, percentages
        text = re.sub(r'[^\w\s.,()$%+-]', ' ', text)
        
        # Remove multiple spaces
        text = re.sub(r' +', ' ', text)
        
        return text.strip()
    
    def process_document(self, file_path: str) -> str:
        """Process a single document and return document ID"""
        # Read file
        with open(file_path, 'r', encoding='utf-8') as f:
            raw_content = f.read()
        
        # Clean content
        content = self._clean_text(raw_content)
        
        # Create document metadata
        filename = Path(file_path).name
        doc_id = Path(file_path).stem  # Use filename without extension as ID
        
        doc_info = DocumentInfo(
            filename=filename,
            title=self._extract_title(filename, content),
            doc_type=self._detect_document_type(filename, content),
            word_count=len(content.split()),
            char_count=len(content)
        )
        
        self.documents[doc_id] = doc_info
        
        # Chunk the document
        if self.chunk_method == "sentences":
            chunks = self.chunker.chunk_by_sentences(content, self.chunk_size)
        elif self.chunk_method == "paragraphs":
            chunks = self.chunker.chunk_by_paragraphs(content, self.chunk_size)
        elif self.chunk_method == "overlap":
            chunks = self.chunker.chunk_with_overlap(content, self.chunk_size, 50)
        else:  # default to characters
            chunks = self.chunker.chunk_by_characters(content, self.chunk_size)
        
        # Create chunk metadata
        for i, chunk_content in enumerate(chunks):
            chunk_id = f"{doc_id}_chunk_{i:03d}"
            
            chunk_info = ChunkInfo(
                chunk_id=chunk_id,
                document_id=doc_id,
                content=chunk_content,
                chunk_index=i,
                char_count=len(chunk_content),
                word_count=len(chunk_content.split())
            )
            
            self.chunks[chunk_id] = chunk_info
        
        print(f"✅ Processed {filename}: {len(chunks)} chunks created")
        return doc_id
    
    def process_directory(self, directory_path: str) -> List[str]:
        """Process all documents in a directory"""
        directory = Path(directory_path)
        doc_ids = []
        
        for file_path in directory.glob("*.txt"):
            doc_id = self.process_document(str(file_path))
            doc_ids.append(doc_id)
        
        return doc_ids
    
    def get_statistics(self) -> Dict:
        """Get processing statistics"""
        total_chunks = len(self.chunks)
        total_docs = len(self.documents)
        
        chunk_sizes = [chunk.char_count for chunk in self.chunks.values()]
        doc_types = [doc.doc_type for doc in self.documents.values()]
        
        return {
            "total_documents": total_docs,
            "total_chunks": total_chunks,
            "avg_chunks_per_doc": total_chunks / total_docs if total_docs > 0 else 0,
            "chunk_size_stats": {
                "min": min(chunk_sizes) if chunk_sizes else 0,
                "max": max(chunk_sizes) if chunk_sizes else 0,
                "mean": np.mean(chunk_sizes) if chunk_sizes else 0,
                "std": np.std(chunk_sizes) if chunk_sizes else 0
            },
            "document_types": {doc_type: doc_types.count(doc_type) for doc_type in set(doc_types)}
        }

print("✅ DocumentProcessor class ready!")

In [None]:
# Test the document processor with our comprehensive financial documents
processor = DocumentProcessor(chunk_method="sentences", chunk_size=400)

# Process all sample documents
print("Processing comprehensive financial documents...\\n")

# Get list of document paths  
doc_paths = [str(path) for path in docs_dir.glob("*.txt")]

if doc_paths:
    doc_ids = []
    for doc_path in doc_paths:
        doc_id = processor.process_document(doc_path)
        doc_ids.append(doc_id)
    
    # Display statistics
    stats = processor.get_statistics()
    print(f"\\n{'='*50}")
    print("PROCESSING STATISTICS")
    print(f"{'='*50}")
    print(f"Total documents: {stats['total_documents']}")
    print(f"Total chunks: {stats['total_chunks']}")
    print(f"Average chunks per document: {stats['avg_chunks_per_doc']:.1f}")
    print(f"\\nChunk size statistics:")
    print(f"  Min: {stats['chunk_size_stats']['min']} chars")
    print(f"  Max: {stats['chunk_size_stats']['max']} chars")
    print(f"  Mean: {stats['chunk_size_stats']['mean']:.1f} chars")
    print(f"  Std: {stats['chunk_size_stats']['std']:.1f} chars")
    print(f"\\nDocument types: {stats['document_types']}")
    
    # Show some example chunks
    print(f"\\n{'='*50}")
    print("SAMPLE CHUNKS")
    print(f"{'='*50}")
    for i, (chunk_id, chunk_info) in enumerate(list(processor.chunks.items())[:3]):
        print(f"\\nChunk {i+1}: {chunk_id}")
        print(f"Document: {processor.documents[chunk_info.document_id].title}")
        print(f"Type: {processor.documents[chunk_info.document_id].doc_type}")
        print(f"Size: {chunk_info.char_count} chars, {chunk_info.word_count} words")
        print(f"Content: \\\"{chunk_info.content[:150]}...\\\"")
    
    print(f"\\n🎯 Success! {stats['total_chunks']} chunks created from {stats['total_documents']} documents")
    print("Now we have enough chunks to properly demonstrate search capabilities!")
    
else:
    print(f"❌ No documents found in {docs_dir}")
    print("Please ensure the sample_financial_docs directory contains .txt files.")

In [None]:
# Test the document processor
processor = DocumentProcessor(chunk_method="sentences", chunk_size=400)

# Process all sample documents
print("Processing sample financial documents...\n")
doc_ids = processor.process_directory("sample_financial_docs")

# Display statistics
stats = processor.get_statistics()
print(f"\n{'='*50}")
print("PROCESSING STATISTICS")
print(f"{'='*50}")
print(f"Total documents: {stats['total_documents']}")
print(f"Total chunks: {stats['total_chunks']}")
print(f"Average chunks per document: {stats['avg_chunks_per_doc']:.1f}")
print(f"\nChunk size statistics:")
print(f"  Min: {stats['chunk_size_stats']['min']} chars")
print(f"  Max: {stats['chunk_size_stats']['max']} chars")
print(f"  Mean: {stats['chunk_size_stats']['mean']:.1f} chars")
print(f"  Std: {stats['chunk_size_stats']['std']:.1f} chars")
print(f"\nDocument types: {stats['document_types']}")

# Show some example chunks
print(f"\n{'='*50}")
print("SAMPLE CHUNKS")
print(f"{'='*50}")
for i, (chunk_id, chunk_info) in enumerate(list(processor.chunks.items())[:3]):
    print(f"\nChunk {i+1}: {chunk_id}")
    print(f"Document: {processor.documents[chunk_info.document_id].title}")
    print(f"Type: {processor.documents[chunk_info.document_id].doc_type}")
    print(f"Size: {chunk_info.char_count} chars, {chunk_info.word_count} words")
    print(f"Content: \"{chunk_info.content[:150]}...\"")

---

## Section 3: Building Efficient Search (25 minutes)

Now let's create embeddings for our chunks and build a powerful search system.

In [None]:
class DocumentSearcher:
    """Efficient similarity-based document search system"""
    
    def __init__(self, processor: DocumentProcessor):
        self.processor = processor
        self.embeddings_generated = False
        
    def generate_embeddings(self, batch_size: int = 10) -> None:
        """Generate embeddings for all chunks"""
        chunks_to_embed = [chunk for chunk in self.processor.chunks.values() 
                          if chunk.embedding is None]
        
        if not chunks_to_embed:
            print("All chunks already have embeddings!")
            self.embeddings_generated = True
            return
        
        print(f"Generating embeddings for {len(chunks_to_embed)} chunks...")
        
        # Process in batches to avoid rate limits
        for i in range(0, len(chunks_to_embed), batch_size):
            batch = chunks_to_embed[i:i + batch_size]
            print(f"Processing batch {i//batch_size + 1}/{(len(chunks_to_embed)-1)//batch_size + 1}")
            
            # Get embeddings for batch
            texts = [chunk.content for chunk in batch]
            
            try:
                # Use OpenAI batch API for efficiency
                response = client.embeddings.create(
                    model="text-embedding-3-small",
                    input=texts
                )
                
                # Assign embeddings to chunks
                for j, embedding_data in enumerate(response.data):
                    batch[j].embedding = embedding_data.embedding
                    
            except Exception as e:
                print(f"Error in batch {i//batch_size + 1}: {e}")
                # Fallback to individual requests
                for chunk in batch:
                    embedding = get_embedding(chunk.content)
                    if embedding:
                        chunk.embedding = embedding
        
        self.embeddings_generated = True
        print("✅ Embedding generation complete!")
    
    def search(self, query: str, top_k: int = 5, min_similarity: float = 0.3, 
               doc_type_filter: Optional[str] = None) -> List[Tuple[ChunkInfo, float]]:
        """Search for relevant chunks using semantic similarity"""
        if not self.embeddings_generated:
            raise ValueError("Embeddings not generated. Call generate_embeddings() first.")
        
        # Get query embedding
        query_embedding = get_embedding(query)
        if not query_embedding:
            return []
        
        # Calculate similarities
        similarities = []
        for chunk in self.processor.chunks.values():
            if chunk.embedding is None:
                continue
            
            # Apply document type filter if specified
            if doc_type_filter:
                doc_type = self.processor.documents[chunk.document_id].doc_type
                if doc_type != doc_type_filter:
                    continue
            
            similarity = cosine_similarity(query_embedding, chunk.embedding)
            
            # Apply minimum similarity filter
            if similarity >= min_similarity:
                similarities.append((chunk, similarity))
        
        # Sort by similarity (highest first) and return top_k
        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:top_k]
    
    def batch_search(self, queries: List[str], top_k: int = 3) -> Dict[str, List[Tuple[ChunkInfo, float]]]:
        """Perform multiple searches efficiently"""
        results = {}
        for query in queries:
            results[query] = self.search(query, top_k=top_k)
        return results
    
    def get_search_statistics(self) -> Dict:
        """Get statistics about the search index"""
        embedded_chunks = sum(1 for chunk in self.processor.chunks.values() 
                             if chunk.embedding is not None)
        
        return {
            "total_chunks": len(self.processor.chunks),
            "embedded_chunks": embedded_chunks,
            "embedding_coverage": embedded_chunks / len(self.processor.chunks) if self.processor.chunks else 0,
            "searchable_documents": len(self.processor.documents)
        }

print("✅ DocumentSearcher class ready!")

In [None]:
# Create searcher and generate embeddings
searcher = DocumentSearcher(processor)

print("Generating embeddings for search index...")
searcher.generate_embeddings(batch_size=5)  # Small batch size for demo

# Display search statistics
search_stats = searcher.get_search_statistics()
print(f"\n{'='*50}")
print("SEARCH INDEX STATISTICS")
print(f"{'='*50}")
print(f"Total chunks: {search_stats['total_chunks']}")
print(f"Embedded chunks: {search_stats['embedded_chunks']}")
print(f"Embedding coverage: {search_stats['embedding_coverage']:.1%}")
print(f"Searchable documents: {search_stats['searchable_documents']}")

### Testing Our Search System

Let's test our search system with various financial queries to see how well it retrieves relevant information.

In [None]:
# Define test queries
test_queries = [
    "revenue growth and financial performance",
    "risk factors and cybersecurity threats",
    "cloud services and technology business",
    "cash flow and liquidity position",
    "market competition and competitive risks"
]

print("🔍 TESTING SEARCH SYSTEM")
print("="*60)

for i, query in enumerate(test_queries, 1):
    print(f"\nQuery {i}: \"{query}\"")
    print("-" * 40)
    
    results = searcher.search(query, top_k=3, min_similarity=0.2)
    
    if not results:
        print("No relevant results found.")
        continue
    
    for j, (chunk, similarity) in enumerate(results, 1):
        doc_title = processor.documents[chunk.document_id].title
        doc_type = processor.documents[chunk.document_id].doc_type
        
        print(f"  {j}. Score: {similarity:.3f} | {doc_type} | {doc_title}")
        print(f"     \"{chunk.content[:100]}...\"")
        print()

### Advanced Search Features

Let's test some advanced search capabilities like filtering by document type.

In [None]:
# Test document type filtering
print("🎯 TESTING FILTERED SEARCH")
print("="*50)

query = "financial performance and growth"
print(f"Query: \"{query}\"\n")

# Search across all document types
all_results = searcher.search(query, top_k=5)
print("All Documents:")
for i, (chunk, similarity) in enumerate(all_results, 1):
    doc_type = processor.documents[chunk.document_id].doc_type
    doc_title = processor.documents[chunk.document_id].title
    print(f"  {i}. {similarity:.3f} | {doc_type} | {doc_title[:30]}...")

# Search only in earnings calls
earnings_results = searcher.search(query, top_k=3, doc_type_filter="earnings_call")
print(f"\nEarnings Calls Only:")
for i, (chunk, similarity) in enumerate(earnings_results, 1):
    doc_title = processor.documents[chunk.document_id].title
    print(f"  {i}. {similarity:.3f} | {doc_title}")
    print(f"     \"{chunk.content[:80]}...\"")

# Search only in risk factors
risk_results = searcher.search("cybersecurity data privacy", top_k=2, doc_type_filter="10k_filing")
print(f"\n10-K Filings (Risk Factors):")
for i, (chunk, similarity) in enumerate(risk_results, 1):
    doc_title = processor.documents[chunk.document_id].title
    print(f"  {i}. {similarity:.3f} | {doc_title}")
    print(f"     \"{chunk.content[:80]}...\"")

---

## Section 4: Complete Retrieval System (20 minutes)

Let's integrate everything into a comprehensive financial document retrieval system with advanced features.

In [None]:
class FinancialDocumentRetriever:
    """Complete financial document retrieval system"""
    
    def __init__(self, chunk_method: str = "sentences", chunk_size: int = 400):
        self.processor = DocumentProcessor(chunk_method, chunk_size)
        self.searcher = None
        self.index_built = False
        
    def ingest_documents(self, document_paths: List[str]) -> None:
        """Ingest multiple documents and build search index"""
        print("📄 Ingesting documents...")
        
        for doc_path in document_paths:
            self.processor.process_document(doc_path)
        
        # Build search index
        print("\n🔍 Building search index...")
        self.searcher = DocumentSearcher(self.processor)
        self.searcher.generate_embeddings()
        self.index_built = True
        
        stats = self.processor.get_statistics()
        print(f"\n✅ Ingestion complete!")
        print(f"   Documents: {stats['total_documents']}")
        print(f"   Chunks: {stats['total_chunks']}")
        print(f"   Types: {list(stats['document_types'].keys())}")
    
    def search_documents(self, query: str, top_k: int = 5, 
                        doc_type: Optional[str] = None,
                        min_similarity: float = 0.3,
                        include_context: bool = True) -> List[Dict]:
        """Search documents with rich result formatting"""
        if not self.index_built:
            raise ValueError("Index not built. Call ingest_documents() first.")
        
        # Perform search
        raw_results = self.searcher.search(query, top_k, min_similarity, doc_type)
        
        # Format results with rich metadata
        formatted_results = []
        for chunk, similarity in raw_results:
            doc_info = self.processor.documents[chunk.document_id]
            
            result = {
                "similarity_score": similarity,
                "chunk_id": chunk.chunk_id,
                "content": chunk.content,
                "document": {
                    "title": doc_info.title,
                    "type": doc_info.doc_type,
                    "filename": doc_info.filename
                },
                "chunk_info": {
                    "index": chunk.chunk_index,
                    "word_count": chunk.word_count,
                    "char_count": chunk.char_count
                }
            }
            
            # Add context if requested
            if include_context:
                context = self.get_document_context(chunk.chunk_id, context_size=1)
                result["context"] = context
            
            formatted_results.append(result)
        
        return formatted_results
    
    def get_document_context(self, chunk_id: str, context_size: int = 2) -> Dict:
        """Get surrounding chunks for context"""
        if chunk_id not in self.processor.chunks:
            return {}
        
        target_chunk = self.processor.chunks[chunk_id]
        doc_id = target_chunk.document_id
        target_index = target_chunk.chunk_index
        
        # Find chunks from same document
        doc_chunks = [chunk for chunk in self.processor.chunks.values() 
                     if chunk.document_id == doc_id]
        doc_chunks.sort(key=lambda x: x.chunk_index)
        
        # Get context chunks
        context = {
            "before": [],
            "after": []
        }
        
        for chunk in doc_chunks:
            if chunk.chunk_index < target_index and chunk.chunk_index >= target_index - context_size:
                context["before"].append({
                    "chunk_id": chunk.chunk_id,
                    "content": chunk.content[:100] + "..."
                })
            elif chunk.chunk_index > target_index and chunk.chunk_index <= target_index + context_size:
                context["after"].append({
                    "chunk_id": chunk.chunk_id,
                    "content": chunk.content[:100] + "..."
                })
        
        return context
    
    def hybrid_search(self, query: str, top_k: int = 5) -> List[Dict]:
        """Combine semantic and keyword-based search (simplified version)"""
        # Get semantic search results
        semantic_results = self.search_documents(query, top_k=top_k, include_context=False)
        
        # Simple keyword boost
        query_words = set(query.lower().split())
        
        for result in semantic_results:
            content_words = set(result["content"].lower().split())
            keyword_overlap = len(query_words.intersection(content_words))
            
            # Boost score based on keyword overlap
            keyword_boost = keyword_overlap * 0.1
            result["hybrid_score"] = result["similarity_score"] + keyword_boost
            result["keyword_matches"] = keyword_overlap
        
        # Re-sort by hybrid score
        semantic_results.sort(key=lambda x: x["hybrid_score"], reverse=True)
        
        return semantic_results
    
    def get_system_summary(self) -> Dict:
        """Get comprehensive system summary"""
        if not self.index_built:
            return {"status": "Index not built"}
        
        doc_stats = self.processor.get_statistics()
        search_stats = self.searcher.get_search_statistics()
        
        return {
            "status": "Ready",
            "documents": doc_stats,
            "search_index": search_stats,
            "capabilities": [
                "Semantic search",
                "Document type filtering",
                "Context retrieval",
                "Hybrid search",
                "Similarity scoring"
            ]
        }

print("✅ FinancialDocumentRetriever class ready!")

### Comprehensive System Test

Let's test our complete retrieval system with real-world financial queries.

In [None]:
# Create and test the complete system
retriever = FinancialDocumentRetriever(chunk_method="sentences", chunk_size=400)

# Get list of document paths
doc_paths = [str(path) for path in Path("sample_docs").glob("*.txt")]

# Ingest documents
retriever.ingest_documents(doc_paths)

# Get system summary
summary = retriever.get_system_summary()
print(f"\n{'='*60}")
print("FINANCIAL DOCUMENT RETRIEVAL SYSTEM")
print(f"{'='*60}")
print(f"Status: {summary['status']}")
print(f"Documents: {summary['documents']['total_documents']}")
print(f"Chunks: {summary['documents']['total_chunks']}")
print(f"Document Types: {list(summary['documents']['document_types'].keys())}")
print(f"Capabilities: {', '.join(summary['capabilities'])}")

### Real-World Query Testing

Let's test with realistic financial analysis queries:

In [None]:
# Test real-world financial queries
financial_queries = [
    "What does the company say about revenue growth?",
    "Find information about debt and liabilities",
    "Show risk factors mentioned in filings",
    "What are the analysts saying about the stock price?",
    "How is the cloud business performing?"
]

print("\n💼 REAL-WORLD FINANCIAL QUERIES")
print("="*60)

for i, query in enumerate(financial_queries, 1):
    print(f"\n🔍 Query {i}: {query}")
    print("-" * 50)
    
    results = retriever.search_documents(query, top_k=2, min_similarity=0.2)
    
    if not results:
        print("   No relevant results found.")
        continue
    
    for j, result in enumerate(results, 1):
        print(f"\n   Result {j}: Score {result['similarity_score']:.3f}")
        print(f"   📄 {result['document']['type']} | {result['document']['title']}")
        print(f"   💬 \"{result['content'][:120]}...\"")
        
        # Show context if available
        if 'context' in result and result['context']['before']:
            print(f"   📝 Context: ...{result['context']['before'][-1]['content']}")

### Testing Hybrid Search

Let's compare semantic search vs hybrid search that combines semantic similarity with keyword matching:

In [None]:
# Compare semantic vs hybrid search
test_query = "cloud revenue growth performance"

print(f"\n🔬 SEMANTIC vs HYBRID SEARCH COMPARISON")
print(f"Query: \"{test_query}\"")
print("="*60)

# Semantic search
semantic_results = retriever.search_documents(test_query, top_k=3, include_context=False)
print("\n📊 SEMANTIC SEARCH:")
for i, result in enumerate(semantic_results, 1):
    print(f"  {i}. Score: {result['similarity_score']:.3f} | {result['document']['type']}")
    print(f"     \"{result['content'][:100]}...\"\n")

# Hybrid search
hybrid_results = retriever.hybrid_search(test_query, top_k=3)
print("\n🔀 HYBRID SEARCH (Semantic + Keywords):")
for i, result in enumerate(hybrid_results, 1):
    print(f"  {i}. Hybrid Score: {result['hybrid_score']:.3f} (Semantic: {result['similarity_score']:.3f}, Keywords: {result['keyword_matches']})")
    print(f"     {result['document']['type']} | \"{result['content'][:100]}...\"\n")

print("💡 Notice how hybrid search can rerank results based on keyword matches!")

### Saving Our Work for Day 2

Let's save our processed documents and embeddings for use in tomorrow's RAG implementation:

In [None]:
# Save document processing results for Day 2
def save_retrieval_system(retriever: FinancialDocumentRetriever, save_dir: str = "saved_index"):
    """Save the complete retrieval system for reuse"""
    save_path = Path(save_dir)
    save_path.mkdir(exist_ok=True)
    
    # Save document metadata
    documents_data = {}
    for doc_id, doc_info in retriever.processor.documents.items():
        documents_data[doc_id] = {
            "filename": doc_info.filename,
            "title": doc_info.title,
            "doc_type": doc_info.doc_type,
            "word_count": doc_info.word_count,
            "char_count": doc_info.char_count
        }
    
    with open(save_path / "documents.json", 'w') as f:
        json.dump(documents_data, f, indent=2)
    
    # Save chunks and embeddings
    chunks_data = {}
    for chunk_id, chunk_info in retriever.processor.chunks.items():
        chunks_data[chunk_id] = {
            "document_id": chunk_info.document_id,
            "content": chunk_info.content,
            "chunk_index": chunk_info.chunk_index,
            "char_count": chunk_info.char_count,
            "word_count": chunk_info.word_count,
            "embedding": chunk_info.embedding
        }
    
    with open(save_path / "chunks_and_embeddings.json", 'w') as f:
        json.dump(chunks_data, f, indent=2)
    
    # Save system configuration
    config = {
        "chunk_method": retriever.processor.chunk_method,
        "chunk_size": retriever.processor.chunk_size,
        "total_documents": len(retriever.processor.documents),
        "total_chunks": len(retriever.processor.chunks),
        "embedding_model": "text-embedding-3-small"
    }
    
    with open(save_path / "config.json", 'w') as f:
        json.dump(config, f, indent=2)
    
    print(f"✅ Retrieval system saved to: {save_path.absolute()}")
    print(f"   📄 Documents: {len(documents_data)}")
    print(f"   🔍 Chunks with embeddings: {len(chunks_data)}")
    print(f"   📁 Files: documents.json, chunks_and_embeddings.json, config.json")

# Save our system
save_retrieval_system(retriever)

print("\n🎯 NOTEBOOK 2 COMPLETE!")
print("="*50)
print("✅ Document processing pipeline built")
print("✅ Similarity search system implemented")
print("✅ Hybrid search capabilities added")
print("✅ Context retrieval working")
print("✅ System saved for Day 2 RAG implementation")
print("\n📋 Ready for Day 2: Complete RAG Pipeline!")

---

## Summary: What We've Built

In this notebook, we've created a comprehensive document processing and search system:

### **🏗️ Core Components:**
1. **DocumentChunker**: Multiple chunking strategies (character, sentence, paragraph, overlap)
2. **DocumentProcessor**: Complete document ingestion pipeline with metadata
3. **DocumentSearcher**: Efficient similarity-based search with filtering
4. **FinancialDocumentRetriever**: Complete system with advanced features

### **⚡ Key Features:**
- **Flexible Chunking**: Choose optimal strategy for your use case
- **Rich Metadata**: Track document types, sources, and chunk relationships
- **Semantic Search**: Find relevant content by meaning, not just keywords
- **Hybrid Search**: Combine semantic similarity with keyword matching
- **Context Retrieval**: Get surrounding chunks for better understanding
- **Filtering**: Search within specific document types
- **Scalable**: Batch processing and efficient embedding management

### **📊 Performance Insights:**
- **Chunking Trade-offs**: Balance between context and precision
- **Search Quality**: Semantic similarity captures meaning beyond keywords
- **Efficiency**: Batch embedding generation and similarity calculations

### **🚀 Ready for Day 2:**
Our retrieval system provides the perfect foundation for implementing a complete RAG (Retrieval-Augmented Generation) pipeline. We have:
- ✅ Document processing and chunking
- ✅ Embedding generation and storage
- ✅ Similarity search and ranking
- ✅ Context and metadata retrieval

**Next**: We'll combine this retrieval system with LLM generation to create a complete question-answering system that can provide accurate, grounded responses about financial documents!