# 🚀 GDPR Compliance Agent - Notebook 1: PDF Processing

## 📋 Table of Contents
1. [Project Overview](#project-overview)
2. [Setup & Imports](#setup-imports)
3. [Create Sample Data](#create-sample-data)
4. [Load & Explore Data](#load-explore-data)
5. [Text Chunking](#text-chunking)
6. [Chunk Analysis](#chunk-analysis)
7. [German PDF Extraction](#german-pdf-extraction)
8. [Save Results](#save-results)

---

## 🎯 Project Overview

**Goal**: Create a GDPR compliance assistant that can answer questions about data protection guidelines.

**This Notebook Focus**: Process text documents and prepare them for the vector database.

**Key Steps**:
- Load sample GDPR handbook
- Extract text from German PDF
- Split text into manageable chunks
- Prepare for embedding generation

---

## ⚙️ Setup & Imports

*Import required libraries and set up the environment*

In [1]:
# !pip install pypdf


In [2]:
# Cell 1: Setup and Imports
import os
import sys
from langchain.document_loaders import TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

import pickle

print("✅ Libraries imported successfully!")

✅ Libraries imported successfully!


## 📄 Create Sample Data

*Since we're starting with English data, we'll create a sample GDPR handbook*

**What we're creating**:
- Basic GDPR principles
- Customer data handling rules  
- Employee data guidelines
- Data breach procedures

*This simulates a real company compliance handbook*

In [3]:

# Cell 2: Create Sample English Data
sample_english_content = """
GDPR COMPLIANCE HANDBOOK FOR SMALL BUSINESSES

SECTION 1: BASIC PRINCIPLES

Article 1: Data Protection Principles
Personal data must be processed lawfully, fairly, and transparently. 
Businesses must clearly state why they collect data and how it will be used.

Article 2: Lawful Basis for Processing
You can process personal data when:
- You have explicit consent from the individual
- It's necessary for a contract
- It's required by law
- It's in the legitimate interests of your business

Article 3: Data Minimization
Only collect data that is strictly necessary for your specific purpose.
Do not collect excessive or irrelevant information.

SECTION 2: CUSTOMER DATA HANDLING

Article 4: Customer Consent
For marketing emails, you must have explicit opt-in consent.
Pre-ticked boxes or assumed consent are not valid.
Customers must be able to withdraw consent easily.

Article 5: Data Retention
Keep customer data only as long as necessary:
- Invoices and contracts: 10 years
- Marketing consent: 2 years (unless renewed)
- Customer complaints: 6 years

Article 6: Data Subject Rights
Customers have the right to:
- Access their personal data
- Correct inaccurate data
- Request deletion of their data
- Object to data processing

SECTION 3: EMPLOYEE DATA

Article 7: Employee Records
Keep employee data secure and confidential:
- Employment contracts: 6 years after employment ends
- Salary records: 10 years
- Performance reviews: 3 years

Article 8: Recruitment Data
Unsuccessful applicant data: 6 months
Interview notes: 12 months

SECTION 4: DATA BREACH PROCEDURES

Article 9: Breach Notification
Report data breaches to authorities within 72 hours.
Inform affected individuals if there is high risk to their rights.
Document all breaches for internal records.

Article 10: Security Measures
Implement appropriate technical security measures.
Train staff on data protection principles.
Regularly review and update security practices.
"""


In [4]:

# Save sample data
os.makedirs("../2_data/raw", exist_ok=True)
with open("../2_data/raw/sample_english_handbook.txt", "w") as f:
    f.write(sample_english_content)

print("✅ Sample English handbook created!")


✅ Sample English handbook created!


## 🔍 Load & Explore Data

*Load our sample data and examine its structure*

**Key Questions**:
- How much text do we have?
- What's the content structure?
- Are there clear sections we can use?

*Understanding your data is crucial for good chunking strategy*

In [5]:

# Cell 3: Load and Explore the Data
loader = TextLoader("../2_data/raw/sample_english_handbook.txt", encoding='utf-8')
documents = loader.load()

print(f"📄 Number of documents: {len(documents)}")
print(f"📝 First 500 characters:")
print(documents[0].page_content[:500] + "...")


📄 Number of documents: 1
📝 First 500 characters:

GDPR COMPLIANCE HANDBOOK FOR SMALL BUSINESSES

SECTION 1: BASIC PRINCIPLES

Article 1: Data Protection Principles
Personal data must be processed lawfully, fairly, and transparently. 
Businesses must clearly state why they collect data and how it will be used.

Article 2: Lawful Basis for Processing
You can process personal data when:
- You have explicit consent from the individual
- It's necessary for a contract
- It's required by law
- It's in the legitimate interests of your business

Articl...


## ✂️ Text Chunking

*Split the document into smaller pieces for processing*

**Why chunking matters**:
- LLMs have context window limits
- Smaller chunks are easier to search
- Better precision in retrieval

**Parameters we're using**:
- `chunk_size=500`: Balance between context and precision
- `chunk_overlap=50`: Maintain context between chunks
- Smart separators: Prefer natural breaks

In [6]:
# Cell 4: Split Text into Chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""]
)

en_sample_chunks = text_splitter.split_documents(documents)

print(f"✂️ Created {len(en_sample_chunks)} text chunks")
print("\n📋 Sample chunk:")
print(f"Content: {en_sample_chunks[2].page_content[:200]}...")
print(f"Length: {len(en_sample_chunks[2].page_content)} characters")


✂️ Created 5 text chunks

📋 Sample chunk:
Content: Article 5: Data Retention
Keep customer data only as long as necessary:
- Invoices and contracts: 10 years
- Marketing consent: 2 years (unless renewed)
- Customer complaints: 6 years

Article 6: Data...
Length: 386 characters


## 📊 Chunk Analysis

*Examine the results of our chunking strategy*

**What to check**:
- Number of chunks created
- Size distribution
- Content quality

**Common Issues**:
- ❌ Chunks too small (lose context)
- ❌ Chunks too large (irrelevant info)
- ✅ Balanced chunks (optimal retrieval)

In [7]:
# Cell 5: Examine Chunk Distribution
chunk_lengths = [len(chunk.page_content) for chunk in en_sample_chunks]

print(f"📊 Chunk statistics:")
print(f"Min length: {min(chunk_lengths)}")
print(f"Max length: {max(chunk_lengths)}")
print(f"Avg length: {sum(chunk_lengths)/len(chunk_lengths):.1f}")

📊 Chunk statistics:
Min length: 338
Max length: 491
Avg length: 399.2


----

----

## 🇩🇪 German PDF Extraction

*Now let's extract text from your actual German PDF*

**What we'll do**:
1. Check if German PDF exists
2. Extract text automatically
3. Process German text chunks
4. Compare with English version

**Important**: We'll use the same chunking strategy for both languages

In [8]:
# Cell 6: Extract Text from German PDF
def extract_german_pdf(pdf_path):
    """Extract text from German PDF automatically"""
    try:
        print(f"🔍 Attempting to extract text from: {pdf_path}")
        
        # Check if file exists
        if not os.path.exists(pdf_path):
            print(f"❌ File not found: {pdf_path}")
            print("💡 Please place your German PDF in the data/raw/ folder")
            return None
        
        # Load PDF using PyPDFLoader
        loader = PyPDFLoader(pdf_path)
        german_documents = loader.load()
        
        print(f"✅ Successfully extracted {len(german_documents)} pages from German PDF")
        
        # Show sample content from first page
        if german_documents:
            first_page_content = german_documents[0].page_content
            print(f"\n📄 Sample from first page (first 300 characters):")
            print(first_page_content[:300] + "..." if len(first_page_content) > 300 else first_page_content)
            
            # Check if text looks like German
            german_keywords = ['der', 'die', 'das', 'und', 'für', 'von', 'mit', 'Datenschutz', 'DSGVO']
            found_keywords = [word for word in german_keywords if word in first_page_content]
            print(f"\n🔤 German keywords found: {found_keywords}")
        
        return german_documents
        
    except Exception as e:
        print(f"❌ Error extracting PDF: {e}")
        return None


In [9]:

# Try to extract German PDF
german_pdf_path = "../2_data/raw/ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf"
german_documents = extract_german_pdf(german_pdf_path)

# If no German PDF found, create a sample German text for testing
if german_documents is None:
    print("\n📝 Creating sample German text for testing...")
    
    sample_german_content = """
    Datenschutz-Handbuch für Handwerksbetriebe
    
    Kapitel 1: Grundlagen der DSGVO
    
    Artikel 1: Datenschutzgrundsätze
    Personenbezogene Daten müssen rechtmäßig, nach Treu und Glauben und transparent verarbeitet werden.
    Unternehmen müssen klar angeben, warum sie Daten sammeln und wie sie verwendet werden.
    
    Artikel 2: Rechtsgrundlagen der Verarbeitung
    Sie dürfen personenbezogene Daten verarbeiten, wenn:
    - Die betroffene Person eingewilligt hat
    - Die Verarbeitung für einen Vertrag erforderlich ist
    - Eine gesetzliche Verpflichtung besteht
    - Berechtigte Interessen des Unternehmens vorliegen
    
    Artikel 3: Datenminimierung
    Sammeln Sie nur Daten, die für den spezifischen Zweck unbedingt erforderlich sind.
    Vermeiden Sie excessive oder irrelevante Informationen.
    
    Kapitel 2: Umgang mit Kundendaten
    
    Artikel 4: Kunden Einwilligung
    Für Marketing-E-Mails ist eine ausdrückliche Opt-In-Einwilligung erforderlich.
    Vorangekreuzte Kästchen oder stillschweigende Zustimmung sind nicht gültig.
    Kunden müssen ihre Einwilligung jederzeit widerrufen können.
    
    Artikel 5: Aufbewahrungsfristen
    Bewahren Sie Kundendaten nur so lange auf wie nötig:
    - Rechnungen und Verträge: 10 Jahre
    - Marketing-Einwilligungen: 2 Jahre (sofern nicht erneuert)
    - Kundenbeschwerden: 6 Jahre
    """
    
    # Save sample German text
    with open("../2_data/raw/sample_german_handbook.txt", "w", encoding='utf-8') as f:
        f.write(sample_german_content)
    
    # Load the sample German text
    loader = TextLoader("../2_data/raw/sample_german_handbook.txt", encoding='utf-8')
    german_documents = loader.load()
    
    print("✅ Sample German handbook created for testing!")

🔍 Attempting to extract text from: ../2_data/raw/ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf
✅ Successfully extracted 99 pages from German PDF

📄 Sample from first page (first 300 characters):
Leitfaden 
Datenschutzrecht 
Was Betriebe zu beachten haben 
 
 
Stand: November 2020 
 
Abteilung Organisation und Recht

🔤 German keywords found: ['und', 'Datenschutz']


In [10]:
# Cell 7: Process German Text Chunks
if german_documents:
    print("\n🔨 Processing German text chunks...")
    
    # Use the same text splitter for consistency
    german_chunks = text_splitter.split_documents(german_documents)
    
    print(f"✂️ Created {len(german_chunks)} German text chunks")
    print(f"📊 German chunk sizes: {[len(chunk.page_content) for chunk in german_chunks[:3]]}...")
    
    # Show sample German chunk
    if german_chunks:
        print(f"\n📋 Sample German chunk:")
        print(f"Content: {german_chunks[1].page_content[:200]}...")
        
    # Compare English vs German
    print(f"\n📈 Comparison:")
    print(f"English chunks: {len(en_sample_chunks)}")
    print(f"German chunks: {len(german_chunks)}")
else:
    print("❌ No German documents to process")


🔨 Processing German text chunks...
✂️ Created 379 German text chunks
📊 German chunk sizes: [121, 471, 447]...

📋 Sample German chunk:
Content: Vorwort 
Seit dem 25. Mai 2018 gelten in allen Mitgliedstaaten der Europäischen Union neue Daten-
schutzregeln. Mit der Reform soll sichergestellt werden, dass in allen Mitgliedstaaten derselbe 
Daten...

📈 Comparison:
English chunks: 5
German chunks: 379


In [11]:
print(german_chunks[377].metadata)
print(german_chunks[377].page_content)

{'producer': 'Microsoft® Word für Office 365', 'creator': 'Microsoft® Word für Office 365', 'creationdate': '2020-11-06T11:24:59+01:00', 'author': 'Kasper, Lisa', 'moddate': '2020-11-06T11:24:59+01:00', 'source': '../2_data/raw/ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf', 'total_pages': 99, 'page': 98, 'page_label': '99'}
Ort, Datum, Unterschrift 
 
 
 
Die Datenverarbeitung ist für die Kontaktaufnahme per Telefon, Fax und E -Mail erforderlich und be-
ruht auf Artikel 6 Abs. 1 a) DSGVO. Eine Weitergabe der Daten an Dritte findet nicht statt. Die Daten 
werden gelöscht, sobald sie für den Zweck ihrer Verarbeitung nicht mehr erforderlich sind.  
 
Sie sind berechtigt, Auskunft der bei uns über Sie gespeicherten Daten zu beantragen sowie bei Un-


## 💾 Save Results

*Save processed chunks for the next notebook*

**What we're saving**:
- English text chunks with metadata
- German text chunks with metadata  
- Ready for embedding generation

**Next Steps**:
- Vector database setup in Notebook 2
- Multilingual embedding generation
- Cross-language search testing

In [13]:
# Cell 8: Save All Chunks for Next Notebook

# Combine or save separately based on your needs
all_chunks = {
    'english': en_sample_chunks,
    'german': german_chunks if 'german_chunks' in locals() else []
}

os.makedirs("../2_data/processed", exist_ok=True)
with open("../2_data/processed/text_chunks.pkl", "wb") as f:
    pickle.dump(all_chunks, f)

print("✅ All text chunks saved for next notebook!")
print(f"📁 English chunks: {len(en_sample_chunks)}")
if 'german_chunks' in locals():
    print(f"📁 German chunks: {len(german_chunks)}")

print("\n🎉 Ready for Notebook 2: Vector Database Setup!")
print("➡️ Next: We'll create embeddings and set up semantic search for both languages")

✅ All text chunks saved for next notebook!
📁 English chunks: 5
📁 German chunks: 379

🎉 Ready for Notebook 2: Vector Database Setup!
➡️ Next: We'll create embeddings and set up semantic search for both languages


-----

-----