# Session 3.1: BakeryAI - Document Loading & Text Processing



[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QhACqBSy8kc_MDLcKRSczWJKLEDKqsz4?usp=sharing)

## 🎯 Welcome to Session 3: RAG (Retrieval Augmented Generation)

### What is RAG?

**RAG** allows LLMs to access external knowledge bases to answer questions accurately.

**The Problem:**
- LLMs only know what they were trained on (cutoff date)
- They can't access your company documents
- They hallucinate when asked about specific information

**The Solution - RAG:**
```
User Question
     ↓
[Search Knowledge Base]
     ↓
[Retrieve Relevant Docs]
     ↓
[LLM + Context] → Answer
```

### BakeryAI Knowledge Base:

We'll load these files:
- 📄 `Customer_Service_Policy.txt` - How to handle customer inquiries
- 📄 `Employee_Handbook.txt` - Company policies and procedures
- 📄 `SOP_Hygiene_Food_Safety.txt` - Safety and hygiene standards
- 📄 `cakes.pdf/docx` - Detailed product catalog
- 📄 `cupcakes.pdf/docx` - Cupcake varieties
- 📄 `desserts.pdf/docx` - Other desserts
- 📄 `accessories.pdf/docx` - Bakery accessories

### Today's Goal:

Build the **document ingestion pipeline**:
1. **Load** documents from multiple formats
2. **Clean** and preprocess text
3. **Split** into manageable chunks
4. **Prepare** for embedding

Let's start! 🚀

In [1]:
!pip install -q langchain langchain-openai langchain-community
!pip install -q pypdf docx2txt unstructured
!pip install -q python-dotenv tiktoken

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.0/76.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.32.4, but you have requests 2.32.5 which is incompatible.[0m[31m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m

In [2]:
import os
from pathlib import Path
from dotenv import load_dotenv

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import (
    TextLoader,
    PyPDFLoader,
    Docx2txtLoader,
    DirectoryLoader
)
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter
)

In [3]:
from google.colab import userdata
import os

# Set OpenAI API key from Google Colab's user environment or default
def set_openai_api_key(default_key: str = "YOUR_API_KEY") -> None:
    """Set the OpenAI API key from Google Colab's user environment or use a default value."""
    #if not (userdata.get("OPENAI_API_KEY") or "OPENAI_API_KEY" in os.environ):
    os.environ["OPENAI_API_KEY"] = userdata.get("MDX_OPENAI_API_KEY") or default_key


set_openai_api_key()
#set_openai_api_key("sk-...")

llm = ChatOpenAI(model="gpt-5-nano")
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

print("✅ Environment ready!")

✅ Environment ready!


In [4]:
!git clone https://github.com/IvanReznikov/mdx-langchain-conclave

Cloning into 'mdx-langchain-conclave'...
remote: Enumerating objects: 27, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 27 (delta 6), reused 24 (delta 3), pack-reused 0 (from 0)[K
Receiving objects: 100% (27/27), 240.64 KiB | 2.56 MiB/s, done.
Resolving deltas: 100% (6/6), done.


## 1. Loading Text Files

Start with the simplest format: plain text files.

In [5]:
# Load Customer Service Policy
def load_text_file(file_path):
    """Load a text file and return Document object"""
    try:
        loader = TextLoader(file_path, encoding='utf-8')
        documents = loader.load()
        return documents
    except FileNotFoundError:
        print(f"⚠️  File not found: {file_path}")
        # Create sample document for demo
        from langchain_core.documents import Document
        return [Document(
            page_content="""CUSTOMER SERVICE POLICY

            At BakeryAI, we are committed to providing exceptional customer service.

            1. Response Time: All customer inquiries must be responded to within 2 hours during business hours.

            2. Order Issues: If a customer reports a damaged or incorrect order, immediately offer a replacement or full refund.

            3. Complaints: Listen actively, apologize sincerely, and provide a solution. Escalate to supervisor if needed.

            4. Refund Policy: Full refunds are provided for orders cancelled within 24 hours. After that, store credit is offered.

            5. Special Requests: We accommodate dietary restrictions and custom orders when possible. Confirm feasibility with kitchen team.
            """,
            metadata={"source": "sample_policy.txt"}
        )]

# Load policy document
policy_docs = load_text_file('/content/mdx-langchain-conclave/data/Customer_Service_Policy.txt')

print(f"✅ Loaded {len(policy_docs)} document(s)")
print(f"\n📄 Document Preview:")
print(policy_docs[0].page_content[:300] + "...")
print(f"\n📊 Metadata: {policy_docs[0].metadata}")

✅ Loaded 1 document(s)

📄 Document Preview:
CUSTOMER SERVICE POLICY AND PROCEDURES

Document Reference: CS-POL-2025-01
Effective Date: January 1, 2025
Review Date: December 31, 2025
Department: Customer Service & Sales

═══════════════════════════════════════════════════════════════

OUR CUSTOMER SERVICE PROMISE

At BakeryAI, we are committed...

📊 Metadata: {'source': '/content/mdx-langchain-conclave/data/Customer_Service_Policy.txt'}


In [6]:
# Load all text files
text_files = [
    '/content/mdx-langchain-conclave/data/Customer_Service_Policy.txt',
    '/content/mdx-langchain-conclave/data/Employee_Handbook.txt',
    '/content/mdx-langchain-conclave/data/SOP_Hygiene_Food_Safety.txt'
]

all_docs = []
for file_path in text_files:
    docs = load_text_file(file_path)
    all_docs.extend(docs)
    print(f"✅ Loaded: {file_path} ({len(docs[0].page_content)} characters)")

print(f"\n📚 Total documents loaded: {len(all_docs)}")

✅ Loaded: /content/mdx-langchain-conclave/data/Customer_Service_Policy.txt (8712 characters)
✅ Loaded: /content/mdx-langchain-conclave/data/Employee_Handbook.txt (14067 characters)
✅ Loaded: /content/mdx-langchain-conclave/data/SOP_Hygiene_Food_Safety.txt (4660 characters)

📚 Total documents loaded: 3


## 2. Loading PDF Files

Load product catalogs from PDF files.

In [7]:
def load_pdf_file(file_path):
    """Load a PDF file and return documents"""
    try:
        loader = PyPDFLoader(file_path)
        documents = loader.load()
        return documents
    except FileNotFoundError:
        print(f"⚠️  PDF not found: {file_path}")
        # Create sample PDF content
        from langchain_core.documents import Document
        return [Document(
            page_content="""CAKES CATALOG

            CHOCOLATE TRUFFLE CAKE
            Price: $45.00
            Description: Rich, decadent chocolate cake with Belgian chocolate truffle filling.
            Layers: Three layers of moist chocolate sponge
            Frosting: Dark chocolate ganache
            Serves: 8-10 people
            Allergens: Dairy, Eggs, Gluten

            VANILLA BEAN CAKE
            Price: $40.00
            Description: Classic vanilla cake made with Madagascar vanilla beans.
            Layers: Three layers of vanilla sponge
            Frosting: Vanilla buttercream
            Serves: 8-10 people
            Allergens: Dairy, Eggs, Gluten

            RED VELVET CAKE
            Price: $50.00
            Description: Velvety red cake with cream cheese frosting.
            Layers: Four layers of red velvet sponge
            Frosting: Cream cheese frosting
            Serves: 10-12 people
            Allergens: Dairy, Eggs, Gluten
            """,
            metadata={"source": "sample_cakes.pdf", "page": 0}
        )]

# Load PDF files
pdf_files = [
    '/content/mdx-langchain-conclave/data/cakes.pdf',
    '/content/mdx-langchain-conclave/data/cupcakes.pdf',
    '/content/mdx-langchain-conclave/data/desserts.pdf',
    '/content/mdx-langchain-conclave/data/accessories.pdf'
]

pdf_docs = []
for pdf_path in pdf_files:
    docs = load_pdf_file(pdf_path)
    pdf_docs.extend(docs)
    print(f"✅ Loaded PDF: {pdf_path} ({len(docs)} page(s))")

print(f"\n📄 Total PDF pages: {len(pdf_docs)}")
print(f"\n📖 First page preview:")
print(pdf_docs[0].page_content[:300] + "...")

✅ Loaded PDF: /content/mdx-langchain-conclave/data/cakes.pdf (8 page(s))
✅ Loaded PDF: /content/mdx-langchain-conclave/data/cupcakes.pdf (1 page(s))
✅ Loaded PDF: /content/mdx-langchain-conclave/data/desserts.pdf (1 page(s))
✅ Loaded PDF: /content/mdx-langchain-conclave/data/accessories.pdf (1 page(s))

📄 Total PDF pages: 11

📖 First page preview:
Cakes descriptions 
1. Torta della Nonna Amore (G) 
This heartwarming cake brings the soul of Tuscan kitchens into your home. Torta della 
Nonna Amore is inspired by the timeless Italian tradition of Sunday lunches, where love 
is ba...


## 3. Loading DOCX Files

Load Microsoft Word documents.

In [8]:
def load_docx_file(file_path):
    """Load a DOCX file and return documents"""
    try:
        loader = Docx2txtLoader(file_path)
        documents = loader.load()
        return documents
    except FileNotFoundError:
        print(f"⚠️  DOCX not found: {file_path}")
        return []

# Load DOCX files
docx_files = [
    '/content/mdx-langchain-conclave/data/cakes.docx',
    '/content/mdx-langchain-conclave/data/cupcakes.docx',
    '/content/mdx-langchain-conclave/data/desserts.docx',
    '/content/mdx-langchain-conclave/data/accessories.docx'
]

docx_docs = []
for docx_path in docx_files:
    docs = load_docx_file(docx_path)
    if docs:
        docx_docs.extend(docs)
        print(f"✅ Loaded DOCX: {docx_path}")

if docx_docs:
    print(f"\n📝 Total DOCX documents: {len(docx_docs)}")

✅ Loaded DOCX: /content/mdx-langchain-conclave/data/cakes.docx
✅ Loaded DOCX: /content/mdx-langchain-conclave/data/cupcakes.docx
✅ Loaded DOCX: /content/mdx-langchain-conclave/data/desserts.docx
✅ Loaded DOCX: /content/mdx-langchain-conclave/data/accessories.docx

📝 Total DOCX documents: 4


## 4. Batch Loading with DirectoryLoader

Load all files from a directory at once.

In [9]:
def load_directory(directory_path, glob_pattern="**/*.txt"):
    """Load all matching files from a directory"""
    try:
        loader = DirectoryLoader(
            directory_path,
            glob=glob_pattern,
            loader_cls=TextLoader,
            loader_kwargs={'encoding': 'utf-8'}
        )
        documents = loader.load()
        return documents
    except Exception as e:
        print(f"⚠️  Error loading directory: {e}")
        return []

# Load all text files from data directory
print("📂 Loading all files from data/ directory...")
dir_docs = load_directory('/content/mdx-langchain-conclave/data/', '**/*.txt')

if dir_docs:
    print(f"✅ Loaded {len(dir_docs)} documents")
    print("\nFiles loaded:")
    for doc in dir_docs:
        print(f"  - {doc.metadata.get('source', 'unknown')}")
else:
    print("⚠️  No files loaded (directory may not exist)")

📂 Loading all files from data/ directory...
✅ Loaded 3 documents

Files loaded:
  - /content/mdx-langchain-conclave/data/Employee_Handbook.txt
  - /content/mdx-langchain-conclave/data/SOP_Hygiene_Food_Safety.txt
  - /content/mdx-langchain-conclave/data/Customer_Service_Policy.txt


## 5. Document Inspection

Understand what we loaded before processing.

In [10]:
# Combine all loaded documents
all_documents = all_docs + pdf_docs + docx_docs

print("📊 DOCUMENT INVENTORY")
print("=" * 70)
print(f"\nTotal documents: {len(all_documents)}")

# Calculate statistics
total_chars = sum(len(doc.page_content) for doc in all_documents)
avg_chars = total_chars / len(all_documents) if all_documents else 0

print(f"Total characters: {total_chars:,}")
print(f"Average per document: {avg_chars:,.0f} characters")

# Group by source
sources = {}
for doc in all_documents:
    source = doc.metadata.get('source', 'unknown')
    sources[source] = sources.get(source, 0) + 1

print("\n📁 Documents by Source:")
for source, count in sources.items():
    print(f"  {source}: {count} document(s)")

📊 DOCUMENT INVENTORY

Total documents: 18
Total characters: 70,254
Average per document: 3,903 characters

📁 Documents by Source:
  /content/mdx-langchain-conclave/data/Customer_Service_Policy.txt: 1 document(s)
  /content/mdx-langchain-conclave/data/Employee_Handbook.txt: 1 document(s)
  /content/mdx-langchain-conclave/data/SOP_Hygiene_Food_Safety.txt: 1 document(s)
  /content/mdx-langchain-conclave/data/cakes.pdf: 8 document(s)
  /content/mdx-langchain-conclave/data/cupcakes.pdf: 1 document(s)
  /content/mdx-langchain-conclave/data/desserts.pdf: 1 document(s)
  /content/mdx-langchain-conclave/data/accessories.pdf: 1 document(s)
  /content/mdx-langchain-conclave/data/cakes.docx: 1 document(s)
  /content/mdx-langchain-conclave/data/cupcakes.docx: 1 document(s)
  /content/mdx-langchain-conclave/data/desserts.docx: 1 document(s)
  /content/mdx-langchain-conclave/data/accessories.docx: 1 document(s)


## 6. Text Cleaning & Preprocessing

In [11]:
import re

def clean_text(text):
    """Clean and normalize text"""
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)

    # Remove special characters but keep basic punctuation
    text = re.sub(r'[^\w\s.,!?;:\-()\[\]{}]', '', text)

    # Normalize line breaks
    text = text.replace('\r\n', '\n').replace('\r', '\n')

    # Strip leading/trailing whitespace
    text = text.strip()

    return text

# Test cleaning
sample_text = """This    is     a     test
    with    extra   spaces     and
    weird\r\nline\rbreaks!!!"""

print("Before cleaning:")
print(repr(sample_text))
print("\nAfter cleaning:")
print(repr(clean_text(sample_text)))

# Apply to all documents
for doc in all_documents:
    doc.page_content = clean_text(doc.page_content)

print("\n✅ All documents cleaned!")

Before cleaning:
'This    is     a     test  \n    with    extra   spaces     and  \n    weird\r\nline\rbreaks!!!'

After cleaning:
'This is a test with extra spaces and weird line breaks!!!'

✅ All documents cleaned!


## 7. Text Splitting Strategies

Why split? LLMs have limited context windows. We need manageable chunks.

In [12]:
# Strategy 1: Character-based splitting
char_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=500,
    chunk_overlap=50,
    length_function=len
)

# Test on a sample document
sample_doc = all_documents[0]
char_chunks = char_splitter.split_documents([sample_doc])

print("📏 CHARACTER-BASED SPLITTING")
print("=" * 70)
print(f"Original document: {len(sample_doc.page_content)} characters")
print(f"Number of chunks: {len(char_chunks)}")
print(f"\nFirst chunk:")
print(char_chunks[0].page_content)
print(f"\nChunk size: {len(char_chunks[0].page_content)} characters")

📏 CHARACTER-BASED SPLITTING
Original document: 7555 characters
Number of chunks: 1

First chunk:
CUSTOMER SERVICE POLICY AND PROCEDURES Document Reference: CS-POL-2025-01 Effective Date: January 1, 2025 Review Date: December 31, 2025 Department: Customer Service  Sales  OUR CUSTOMER SERVICE PROMISE At BakeryAI, we are committed to providing exceptional service, high-quality products, and memorable experiences. Every customer interaction is an opportunity to build lasting relationships and exceed expectations.  CORE SERVICE STANDARDS: 1. GREETING AND ENGAGEMENT  Greet every customer within 30 seconds of entry  Make eye contact and smile warmly  Use customers name when known  Offer assistance proactively  Sample offering: Provide complimentary samples of new products 2. PRODUCT KNOWLEDGE  Know ingredients, allergens, and preparation methods for all items  Provide accurate pricing and availability information  Suggest complementary products  Inform customers about daily specials and promo

In [13]:
# Strategy 2: Recursive splitting (recommended)
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

recursive_chunks = recursive_splitter.split_documents([sample_doc])

print("\n🔄 RECURSIVE SPLITTING")
print("=" * 70)
print(f"Number of chunks: {len(recursive_chunks)}")
print(f"\nFirst chunk:")
print(recursive_chunks[0].page_content[:300] + "...")

print("\n📊 Chunk size distribution:")
for i, chunk in enumerate(recursive_chunks[:5]):
    print(f"  Chunk {i+1}: {len(chunk.page_content)} characters")


🔄 RECURSIVE SPLITTING
Number of chunks: 11

First chunk:
CUSTOMER SERVICE POLICY AND PROCEDURES Document Reference: CS-POL-2025-01 Effective Date: January 1, 2025 Review Date: December 31, 2025 Department: Customer Service  Sales  OUR CUSTOMER SERVICE PROMISE At BakeryAI, we are committed to providing exceptional service, high-quality products, and memora...

📊 Chunk size distribution:
  Chunk 1: 994 characters
  Chunk 2: 882 characters
  Chunk 3: 999 characters
  Chunk 4: 999 characters
  Chunk 5: 495 characters


In [14]:
# Strategy 3: Token-based splitting (most accurate for LLMs)
token_splitter = TokenTextSplitter(
    chunk_size=200,  # tokens, not characters
    chunk_overlap=50
)

token_chunks = token_splitter.split_documents([sample_doc])

print("\n🎫 TOKEN-BASED SPLITTING")
print("=" * 70)
print(f"Number of chunks: {len(token_chunks)}")
print(f"\nFirst chunk:")
print(token_chunks[0].page_content[:300] + "...")

# Token counting
import tiktoken
encoder = tiktoken.get_encoding("cl100k_base")

print("\n📊 Token count verification:")
for i, chunk in enumerate(token_chunks[:3]):
    tokens = encoder.encode(chunk.page_content)
    print(f"  Chunk {i+1}: {len(tokens)} tokens, {len(chunk.page_content)} characters")


🎫 TOKEN-BASED SPLITTING
Number of chunks: 11

First chunk:
CUSTOMER SERVICE POLICY AND PROCEDURES Document Reference: CS-POL-2025-01 Effective Date: January 1, 2025 Review Date: December 31, 2025 Department: Customer Service  Sales  OUR CUSTOMER SERVICE PROMISE At BakeryAI, we are committed to providing exceptional service, high-quality products, and memora...

📊 Token count verification:
  Chunk 1: 199 tokens, 987 characters
  Chunk 2: 196 tokens, 1021 characters
  Chunk 3: 191 tokens, 889 characters


## 8. Optimal Chunking Strategy for BakeryAI

Choose the best strategy for our use case.

In [15]:
# Recommended: Recursive splitter with moderate chunk size
optimal_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,      # Good balance
    chunk_overlap=100,   # Maintain context
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# Split all documents
all_chunks = optimal_splitter.split_documents(all_documents)

print("✅ OPTIMIZED CHUNKING COMPLETE")
print("=" * 70)
print(f"\nOriginal documents: {len(all_documents)}")
print(f"Final chunks: {len(all_chunks)}")
print(f"Average chunk size: {sum(len(c.page_content) for c in all_chunks) / len(all_chunks):.0f} characters")

# Analyze chunk distribution
chunk_sizes = [len(chunk.page_content) for chunk in all_chunks]
import statistics

print(f"\n📊 Chunk Statistics:")
print(f"  Min size: {min(chunk_sizes)} characters")
print(f"  Max size: {max(chunk_sizes)} characters")
print(f"  Median: {statistics.median(chunk_sizes):.0f} characters")
print(f"  Std dev: {statistics.stdev(chunk_sizes):.0f} characters")

✅ OPTIMIZED CHUNKING COMPLETE

Original documents: 18
Final chunks: 101
Average chunk size: 666 characters

📊 Chunk Statistics:
  Min size: 54 characters
  Max size: 800 characters
  Median: 744 characters
  Std dev: 187 characters


## 9. Metadata Enhancement

Add useful metadata to chunks for better retrieval.

In [16]:
def enhance_metadata(chunks):
    """Add useful metadata to chunks"""

    for i, chunk in enumerate(chunks):
        # Add chunk ID
        chunk.metadata['chunk_id'] = i

        # Add chunk size
        chunk.metadata['chunk_size'] = len(chunk.page_content)

        # Categorize by source type
        source = chunk.metadata.get('source', '')
        if 'policy' in source.lower():
            chunk.metadata['category'] = 'policy'
        elif 'handbook' in source.lower():
            chunk.metadata['category'] = 'handbook'
        elif 'sop' in source.lower() or 'hygiene' in source.lower():
            chunk.metadata['category'] = 'safety'
        elif 'cake' in source.lower() or 'cupcake' in source.lower() or 'dessert' in source.lower():
            chunk.metadata['category'] = 'product'
        else:
            chunk.metadata['category'] = 'general'

        # Extract keywords (simple approach)
        keywords = []
        text_lower = chunk.page_content.lower()

        bakery_keywords = ['cake', 'chocolate', 'vanilla', 'order', 'delivery',
                          'refund', 'policy', 'customer', 'price', 'allergen']

        for keyword in bakery_keywords:
            if keyword in text_lower:
                keywords.append(keyword)

        chunk.metadata['keywords'] = keywords

    return chunks

# Enhance all chunks
all_chunks = enhance_metadata(all_chunks)

print("✅ Metadata enhanced!")
print("\n📋 Sample chunk metadata:")
print(all_chunks[0].metadata)

✅ Metadata enhanced!

📋 Sample chunk metadata:
{'source': '/content/mdx-langchain-conclave/data/Customer_Service_Policy.txt', 'chunk_id': 0, 'chunk_size': 680, 'category': 'policy', 'keywords': ['policy', 'customer']}


## 10. Saving Processed Documents

In [17]:
import pickle

def save_chunks(chunks, filepath='processed_chunks.pkl'):
    """Save processed chunks for later use"""
    with open(filepath, 'wb') as f:
        pickle.dump(chunks, f)
    print(f"✅ Saved {len(chunks)} chunks to {filepath}")

def load_chunks(filepath='processed_chunks.pkl'):
    """Load previously processed chunks"""
    try:
        with open(filepath, 'rb') as f:
            chunks = pickle.load(f)
        print(f"✅ Loaded {len(chunks)} chunks from {filepath}")
        return chunks
    except FileNotFoundError:
        print(f"⚠️  File not found: {filepath}")
        return []

# Save our processed chunks
save_chunks(all_chunks, 'bakery_knowledge_base.pkl')

# Test loading
loaded = load_chunks('bakery_knowledge_base.pkl')
print(f"\n✅ Verification: Loaded {len(loaded)} chunks")

✅ Saved 101 chunks to bakery_knowledge_base.pkl
✅ Loaded 101 chunks from bakery_knowledge_base.pkl

✅ Verification: Loaded 101 chunks


## 🎯 Exercise 1: Custom Document Loader

**Task**: Create a custom loader for a new file format:
1. Load markdown (.md) files
2. Parse headers and preserve structure
3. Extract metadata from frontmatter

In [18]:
from langchain_core.documents import Document

class MarkdownLoader:
    def __init__(self, file_path):
        self.file_path = file_path

    def load(self):
        """Load and parse markdown file"""
        # TODO: Implement markdown loading
        # Hint: Parse headers, extract metadata, preserve structure
        pass

# Test your loader
# loader = MarkdownLoader('data/menu.md')
# docs = loader.load()

## 🎯 Exercise 2: Intelligent Chunking

**Task**: Create a smart chunker that:
1. Detects section boundaries
2. Keeps related content together
3. Adapts chunk size based on content type

In [19]:
class SmartChunker:
    def __init__(self, min_size=500, max_size=1500):
        self.min_size = min_size
        self.max_size = max_size

    def split_documents(self, documents):
        """Intelligently split documents"""
        # TODO: Implement smart chunking
        # Consider: section headers, lists, tables
        pass

# Test your chunker

## Summary: What We Built

### ✅ Session 3.1 Achievements:

1. **Document Loading**: Text, PDF, DOCX files
2. **Batch Processing**: Directory loading
3. **Text Cleaning**: Normalization and preprocessing
4. **Splitting Strategies**: Character, recursive, token-based
5. **Metadata Enhancement**: Categories, keywords, IDs
6. **Persistence**: Save/load processed chunks

### 📚 BakeryAI Knowledge Base:

✨ **Policies**: Customer service standards  
✨ **Procedures**: Employee handbook  
✨ **Safety**: Hygiene and food safety SOPs  
✨ **Products**: Detailed catalog information  
✨ **Structured**: Optimized chunks with metadata  

### 🚀 Next: Notebook 3.2

We'll create **embeddings and vector stores**:
- Generate embeddings for all chunks
- Build vector databases (FAISS, Chroma)
- Implement semantic search
- Compare vector store performance