# 🚀 GDPR Compliance Agent - Notebook 1: PDF Processing

## 📋 Table of Contents
1. [Project Overview](#project-overview)
2. [Setup & Imports](#setup-imports)
3. [Load & Explore Data PDF](#load-explore-data)
4. [Text Chunking](#text-chunking)
5. [Chunk Analysis](#chunk-analysis)
6. [Save Results](#save-results)

---

## 🎯 Project Overview

**Goal**: Create a GDPR compliance assistant that can answer questions about data protection guidelines.

**This Notebook Focus**: Process text documents and prepare them for the vector database.

**Key Steps**:
- Load sample GDPR handbook
- Extract text from German PDF
- Split text into manageable chunks
- Prepare for embedding generation

---

## ⚙️ Setup & Imports

*Import required libraries and set up the environment*

In [1]:
# !pip install pypdf

In [2]:
# Cell 1: Setup and Imports
import os
import sys
# from dotenv import load_dotenv

from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
import tiktoken

import pickle

# Add the project root directory to Python path
sys.path.append(os.path.abspath('..'))

# Helper functions:
from src.embedding_cost_calculator import calculate_embedding_cost, quick_cost

# Load environment variables
# load_dotenv()

print("✅ Libraries imported successfully!")

✅ Libraries imported successfully!


In [3]:
# Quick test of our cost calculator
test_text = "This is a test of the cost calculator"
tokens, cost = quick_cost(test_text)
print(f"✅ Cost calculator test: {tokens} tokens = ${cost:.8f}")

✅ Cost calculator test: 8 tokens = $0.00000016


----

----

## 🇩🇪 German PDF Extraction

*Now let's extract text from your actual German PDF*

**What we'll do**:
1. Check if German PDF exists
2. Extract text automatically
3. Process German text chunks


In [4]:
# Cell 3: PDF Extraction with PyPDFLoader
def extract_pdf_with_metadata(pdf_path):
    """Extract text from PDF using PyPDFLoader with enhanced metadata"""
    try:
        print(f"📄 Extracting from: {pdf_path}")
        
        if not os.path.exists(pdf_path):
            print(f"❌ File not found: {pdf_path}")
            print("💡 Please place your ZDH PDF in data/raw/ folder")
            return None
        
        # Use PyPDFLoader (faster and reliable)
        loader = PyPDFLoader(pdf_path)
        documents = loader.load()
        
        print(f"✅ Successfully extracted {len(documents)} pages")
        
        # Enhanced metadata for better retrieval
        enhanced_docs = []
        for i, doc in enumerate(documents):
            # Define keys to remove from original PDF metadata
            keys_to_remove = ['producer', 'creator']
            
            # Start with original metadata but remove unwanted keys
            clean_metadata = {}
            for key, value in doc.metadata.items():
                if key not in keys_to_remove:
                    clean_metadata[key] = value
            
            # Add our custom metadata
            custom_metadata = {
                "document_type": "zdh_gdpr_handbook",
                "document_name": os.path.basename(pdf_path),
                "language": "german",
                "source": "zdh_handbook",
                "page_number": i + 1,
                "total_pages": len(documents),
                "content_length": len(doc.page_content),
                "content_category": categorize_content(doc.page_content),
                "section_type": identify_section_type(doc.page_content),
            }
            
            # Merge clean original metadata with our custom metadata
            final_metadata = {**custom_metadata, **clean_metadata}
            
            enhanced_doc = Document(
                page_content=doc.page_content,
                metadata=final_metadata
            )
            enhanced_docs.append(enhanced_doc)
        
        # Show first page as sample
        if enhanced_docs:
            print(f"\n📋 First page sample:")
            print(enhanced_docs[0].page_content[:200] + "...")
            print(f"📊 Metadata: {enhanced_docs[0].metadata}")
        
        return enhanced_docs
        
    except Exception as e:
        print(f"❌ PDF extraction error: {e}")
        return None

def categorize_content(text):
    """Categorize GDPR content for better filtering"""
    text_lower = text.lower()
    
    if any(keyword in text_lower for keyword in ['kunde', 'customer', 'marketing']):
        return "customer_data"
    elif any(keyword in text_lower for keyword in ['mitarbeiter', 'employee', 'personal']):
        return "employee_data"
    elif any(keyword in text_lower for keyword in ['recht', 'law', 'gesetz', 'dsgvo']):
        return "legal_basis"
    elif any(keyword in text_lower for keyword in ['sicherheit', 'security', 'datenschutzverletzung']):
        return "security"
    elif any(keyword in text_lower for keyword in ['speicherung', 'retention', 'aufbewahrung']):
        return "data_retention"
    else:
        return "general"

def identify_section_type(text):
    """Identify section types for better chunking"""
    text = text.strip()
    if len(text) < 200 and any(indicator in text for indicator in ['KAPITEL', 'ARTIKEL', 'SECTION']):
        return "section_header"
    elif len(text) < 100 and text.isupper():
        return "heading"
    else:
        return "content"

In [5]:
# Cell 4: Extract German PDF
german_pdf_path = "../2_data/raw/ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf"
documents = extract_pdf_with_metadata(german_pdf_path)

if not documents:
    print("❌ No documents extracted. Stopping here.")
else:
    print(f"✅ Ready to process {len(documents)} PDF pages")
    
    # Estimate cost for raw PDF using our imported function
    raw_texts = [doc.page_content for doc in documents]
    cost_info = calculate_embedding_cost(raw_texts)

📄 Extracting from: ../2_data/raw/ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf
✅ Successfully extracted 99 pages

📋 First page sample:
Leitfaden 
Datenschutzrecht 
Was Betriebe zu beachten haben 
 
 
Stand: November 2020 
 
Abteilung Organisation und Recht...
📊 Metadata: {'document_type': 'zdh_gdpr_handbook', 'document_name': 'ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf', 'language': 'german', 'source': '../2_data/raw/ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf', 'page_number': 1, 'total_pages': 99, 'content_length': 121, 'content_category': 'legal_basis', 'section_type': 'content', 'creationdate': '2020-11-06T11:24:59+01:00', 'author': 'Kasper, Lisa', 'moddate': '2020-11-06T11:24:59+01:00', 'page': 0, 'page_label': '1'}
✅ Ready to process 99 PDF pages
📊 Embedding Cost Calculation
   Model: text-embedding-3-small
   Texts: 99
   Total tokens: 46,778
   Cost: $0.000936
   Avg tokens per text: 472.5


## ✂️ Text Chunking

*Split the document into smaller pieces for processing*

**Why chunking matters**:
- LLMs have context window limits
- Smaller chunks are easier to search
- Better precision in retrieval

**Parameters we're using**:
- `chunk_size=800`: Balance between context and precision
- `chunk_overlap=120`: Maintain context between chunks
- Smart separators: Prefer natural breaks

In [6]:
# Cell 5: Text Chunking
def create_optimized_splitter():
    """Create optimized splitter for GDPR legal documents"""
    return RecursiveCharacterTextSplitter(
        chunk_size=800,           # Optimal for legal text precision
        chunk_overlap=120,        # 15% overlap for context
        separators=["\n\n", "\n", ". ", "! ", "? ", "; ", ", ", " ", ""],
        length_function=len,      # Measures chunk size in CHARACTERS, Use character length for precision
    )

def chunk_documents(documents):
    """Chunk documents with our optimized settings - returns List"""
    text_splitter = create_optimized_splitter()
    chunks = text_splitter.split_documents(documents)  # This is List[Document]

    print("\n🔨 Processing text chunks...")
    
    print(f"✂️ Chunking Results:")
    print(f"   Input documents: {len(documents)}")
    print(f"   Output chunks: {len(chunks)}")
    print(f"   Chunk type: {type(chunks[0])}")  # Should show Document
    
    # Analyze chunk sizes - we're accessing .page_content of Document objects
    chunk_sizes = [len(chunk.page_content) for chunk in chunks]
    avg_size = sum(chunk_sizes) / len(chunk_sizes)
    
    print(f"📊 Chunk Size Analysis:")
    print(f"   Average: {avg_size:.0f} characters")
    print(f"   Range: {min(chunk_sizes)} - {max(chunk_sizes)} characters")
    
    # Add chunk-specific metadata to Document objects
    for i, chunk in enumerate(chunks):
        chunk.metadata.update({
            "chunk_id": i + 1,
            "chunk_size": len(chunk.page_content),  # Accessing Document attribute
            "total_chunks": len(chunks)
        })
    
    return chunks  # Returns List[Document]

# Process the chunks
chunks = chunk_documents(documents)

# Show chunk samples
if chunks:
    print(f"\n📋 Chunk Samples:")
    for i in range(min(3, len(chunks))):
        print(f"Chunk {i+1}: {chunks[i].page_content[:100]}...")
        print(f"Size: {len(chunks[i].page_content)} chars | Category: {chunks[i].metadata.get('content_category', 'N/A')}")
        print("---")


🔨 Processing text chunks...
✂️ Chunking Results:
   Input documents: 99
   Output chunks: 266
   Chunk type: <class 'langchain_core.documents.base.Document'>
📊 Chunk Size Analysis:
   Average: 642 characters
   Range: 7 - 799 characters

📋 Chunk Samples:
Chunk 1: Leitfaden 
Datenschutzrecht 
Was Betriebe zu beachten haben 
 
 
Stand: November 2020 
 
Abteilung O...
Size: 121 chars | Category: legal_basis
---
Chunk 2: Vorwort 
Seit dem 25. Mai 2018 gelten in allen Mitgliedstaaten der Europäischen Union neue Daten-
sc...
Size: 749 chars | Category: legal_basis
---
Chunk 3: pekte und Fragen. Er bietet neben rechtlichen Erklärungen zahlreiche Beispielsfälle, Checklis-
ten u...
Size: 644 chars | Category: legal_basis
---


## 📊 Chunk Analysis

*Examine the results of our chunking strategy*

**What to check**:
- Number of chunks created
- Size distribution
- Content quality

**Common Issues**:
- ❌ Chunks too small (lose context)
- ❌ Chunks too large (irrelevant info)
- ✅ Balanced chunks (optimal retrieval)

In [7]:
# Cell 5: Examine Chunk Distribution
chunk_lengths = [len(chunk.page_content) for chunk in chunks]

print(f"📊 Chunk statistics:")
print(f"Min length: {min(chunk_lengths)}")
print(f"Max length: {max(chunk_lengths)}")
print(f"Avg length: {sum(chunk_lengths)/len(chunk_lengths):.1f}")

📊 Chunk statistics:
Min length: 7
Max length: 799
Avg length: 642.1


In [8]:
type(chunks[1])

langchain_core.documents.base.Document

In [9]:
print(chunks[1].metadata)

{'document_type': 'zdh_gdpr_handbook', 'document_name': 'ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf', 'language': 'german', 'source': '../2_data/raw/ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf', 'page_number': 2, 'total_pages': 99, 'content_length': 1299, 'content_category': 'legal_basis', 'section_type': 'content', 'creationdate': '2020-11-06T11:24:59+01:00', 'author': 'Kasper, Lisa', 'moddate': '2020-11-06T11:24:59+01:00', 'page': 1, 'page_label': '2', 'chunk_id': 2, 'chunk_size': 749, 'total_chunks': 266}


In [10]:
# Cell 6: Calculate Embedding Costs
def analyze_chunk_costs(chunks):
    """Analyze costs for the chunked documents - chunks is List[Document]"""
    # Extract just the text content for cost calculation
    texts = [chunk.page_content for chunk in chunks]  # Convert to List[str]
    cost_info = calculate_embedding_cost(texts)
    
    print(f"\n📈 Project Cost Summary:")
    print(f"   Total Document chunks: {len(chunks)}")
    print(f"   Chunk object type: {type(chunks[0])}")
    print(f"   Estimated tokens: {cost_info['total_tokens']:,}")
    print(f"   Estimated cost: ${cost_info['total_cost']:.4f}")
    
    return cost_info

chunk_costs = analyze_chunk_costs(chunks)

📊 Embedding Cost Calculation
   Model: text-embedding-3-small
   Texts: 266
   Total tokens: 51,072
   Cost: $0.001021
   Avg tokens per text: 192.0

📈 Project Cost Summary:
   Total Document chunks: 266
   Chunk object type: <class 'langchain_core.documents.base.Document'>
   Estimated tokens: 51,072
   Estimated cost: $0.0010


In [11]:
print(chunks[0].metadata)
print(chunks[0].page_content)

{'document_type': 'zdh_gdpr_handbook', 'document_name': 'ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf', 'language': 'german', 'source': '../2_data/raw/ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf', 'page_number': 1, 'total_pages': 99, 'content_length': 121, 'content_category': 'legal_basis', 'section_type': 'content', 'creationdate': '2020-11-06T11:24:59+01:00', 'author': 'Kasper, Lisa', 'moddate': '2020-11-06T11:24:59+01:00', 'page': 0, 'page_label': '1', 'chunk_id': 1, 'chunk_size': 121, 'total_chunks': 266}
Leitfaden 
Datenschutzrecht 
Was Betriebe zu beachten haben 
 
 
Stand: November 2020 
 
Abteilung Organisation und Recht


## 💾 Save Results

*Save processed chunks for the next notebook*

**What we're saving**:
- English text chunks with metadata
- German text chunks with metadata  
- Ready for embedding generation

**Next Steps**:
- Vector database setup in Notebook 2
- Multilingual embedding generation
- Cross-language search testing

In [12]:
# Cell 7: Save Processed Data
def save_processed_data(chunks):
    """Save processed chunks for next notebook"""
    os.makedirs("../2_data/processed", exist_ok=True)
    
    # Save chunks as serializable data
    serializable_chunks = []
    for chunk in chunks:
        serializable_chunks.append({
            'page_content': chunk.page_content,
            'metadata': chunk.metadata
        })
    
    with open("../2_data/processed/chunks.pkl", "wb") as f:
        pickle.dump(serializable_chunks, f)
    
    # Save configuration
    config = {
        "index_name": "gdpr-compliance",
        "total_chunks": len(chunks),
        "chunk_size": 800,
        "chunk_overlap": 120,
        "embedding_model": "text-embedding-3-small",
        "total_tokens": cost_info['total_tokens'],
        "estimated_cost": cost_info['total_cost']
    }
    
    with open("../2_data/processed/config.pkl", "wb") as f:
        pickle.dump(config, f)
    
    print(f"💾 Saved {len(chunks)} chunks to ../2_data/processed/chunks.pkl")
    print(f"💾 Saved configuration to ../2_data/processed/config.pkl")
    
    return config

# Save everything
if chunks:
    config = save_processed_data(chunks)

print("\n🎉 Notebook 1 Complete!")
print("="*50)
print("📝 SUMMARY")
print("="*50)
print(f"📄 PDF Pages Processed: {len(documents)}")
print(f"✂️  Chunks Created: {len(chunks)}")
print(f"💰 Estimated Cost: ${cost_info['total_cost']:.4f}")
print(f"🔢 Estimated Tokens: {cost_info['total_tokens']:,}")
print(f"📊 Avg Chunk Size: {sum(len(c.page_content) for c in chunks)/len(chunks):.0f} chars")
print("="*50)
print("➡️  Next: Run Notebook 2 to upload to Pinecone")
print("   Notebook 2 will require:")
print("   - OPENAI_API_KEY for embeddings")
print("   - PINECONE_API_KEY for vector database")
print("   - PINECONE_ENVIRONMENT for Pinecone setup")

💾 Saved 266 chunks to ../2_data/processed/chunks.pkl
💾 Saved configuration to ../2_data/processed/config.pkl

🎉 Notebook 1 Complete!
📝 SUMMARY
📄 PDF Pages Processed: 99
✂️  Chunks Created: 266
💰 Estimated Cost: $0.0009
🔢 Estimated Tokens: 46,778
📊 Avg Chunk Size: 642 chars
➡️  Next: Run Notebook 2 to upload to Pinecone
   Notebook 2 will require:
   - OPENAI_API_KEY for embeddings
   - PINECONE_API_KEY for vector database
   - PINECONE_ENVIRONMENT for Pinecone setup


-----------

-----

-----

-----

# DRAFTs

## 🔍 Load & Explore Data

*Load our sample data and examine its structure*

**Key Questions**:
- How much text do we have?
- What's the content structure?
- Are there clear sections we can use?

*Understanding your data is crucial for good chunking strategy*