# 🗄️ GDPR Compliance Agent - Notebook 2: Pinecone Vector Database Setup

## 📋 Table of Contents
1. [Overview](#overview)
2. [Setup & Environment](#setup--environment)
3. [Load Processed Data](#load-processed-data)
4. [Initialize OpenAI Embeddings](#initialize-openai-embeddings)
5. [Pinecone Setup](#pinecone-setup)
6. [Upload to Vector Database](#upload-to-vector-database)
7. [Verification & Testing](#verification--testing)
8. [Next Steps](#next-steps)

---

## 🎯 Overview

**Goal**: Upload our processed GDPR document chunks to Pinecone vector database and create embeddings using OpenAI.

**This Notebook Focus**:
- Load chunks processed in Notebook 1
- Initialize OpenAI embeddings
- Connect to Pinecone vector database
- Upload documents with embeddings
- Verify the setup works

**Key Technologies**:
- OpenAI `text-embedding-3-small` for embeddings
- Pinecone for vector storage and search
- LangChain for orchestration



---

## ⚙️ Setup & Environment

*Import required libraries and set up environment variables*

In [1]:
# Cell 1: Setup and Imports
import os
import sys
import pickle
import time
from dotenv import load_dotenv

# Add project root to Python path
sys.path.append(os.path.abspath('..'))

# Load environment variables
load_dotenv()

# LangChain imports
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

# Pinecone imports
import pinecone
from pinecone import ServerlessSpec

# Helper functions
from src.embedding_cost_calculator import calculate_embedding_cost

# print("✅ All imports completed successfully!")
# print("🔑 Environment Check:")
# print(f"   OpenAI API Key: {'✅' if os.getenv('OPENAI_API_KEY') else '❌'}")
# print(f"   Pinecone API Key: {'✅' if os.getenv('PINECONE_API_KEY') else '❌'}")

# if not all([os.getenv('OPENAI_API_KEY'), os.getenv('PINECONE_API_KEY')]):
#     print("\n❌ Missing API keys. Please check your .env file")


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from langchain_pinecone.vectorstores import Pinecone, PineconeVectorStore


In [2]:
# # Cell 1: Setup and Imports. (OlD)
# import os 
# import sys
# import pickle
# from dotenv import load_dotenv

# # Add project root to Python path
# sys.path.append(os.path.abspath('..'))

# # LangChain imports
# from langchain.embeddings.openai import OpenAIEmbeddings
# from langchain.vectorstores import Pinecone
# from langchain.schema import Document
# from langchain_pinecone import PineconeVectorStore

# # Pinecone import
# from pinecone import Pinecone
# from pinecone import ServerlessSpec




# # Import time for the wait functionality
# import time

# # Helper functions
# from src.embedding_cost_calculator import calculate_embedding_cost

# print("✅ Libraries imported successfully!")

In [3]:
# Cell 2: Load Environment Variables
# Load API keys from .env file
load_dotenv()

# Verify environment variables
print("🔑 Environment Configuration:")
print(f"   OpenAI API Key: {'✅' if os.getenv('OPENAI_API_KEY') else '❌'}")
print(f"   Pinecone API Key: {'✅' if os.getenv('PINECONE_API_KEY') else '❌'}")
# print(f"   Pinecone Environment: {'✅' if os.getenv('PINECONE_ENVIRONMENT') else '❌'}")

# if not all([os.getenv('OPENAI_API_KEY'), os.getenv('PINECONE_API_KEY'), os.getenv('PINECONE_ENVIRONMENT')]):
if not all([os.getenv('OPENAI_API_KEY'), os.getenv('PINECONE_API_KEY')]):
    print("\n⚠️  Missing environment variables!")
    print("   Please check your .env file contains:")
    print("   - OPENAI_API_KEY")
    print("   - PINECONE_API_KEY")
    # print("   - PINECONE_ENVIRONMENT")
    print("\n   These are required for this notebook.")

🔑 Environment Configuration:
   OpenAI API Key: ✅
   Pinecone API Key: ✅


---

## 📥 Load Processed Data

*Load the chunks and configuration saved from Notebook 1*


In [4]:
# Cell 3: Load Processed Chunks
def load_processed_data():
    """Load chunks and config from Notebook 1 processing"""
    try:
        # Load configuration
        with open("../2_data/processed/config.pkl", "rb") as f:
            config = pickle.load(f)
        
        # Load chunks
        with open("../2_data/processed/chunks.pkl", "rb") as f:
            serializable_chunks = pickle.load(f)
        
        # Recreate Document objects
        chunks = []
        for chunk_data in serializable_chunks:
            doc = Document(
                page_content=chunk_data['page_content'],
                metadata=chunk_data['metadata']
            )
            chunks.append(doc)
        
        print("✅ Successfully loaded processed data:")
        print(f"   - Chunks: {len(chunks)}")
        print(f"   - Chunk size: {config['chunk_size']}")
        print(f"   - Overlap: {config['chunk_overlap']}")
        print(f"   - Estimated tokens: {config['total_tokens']:,}")
        print(f"   - Estimated cost: ${config['estimated_cost']:.4f}")
        
        return chunks, config
        
    except FileNotFoundError as e:
        print(f"❌ Error: {e}")
        print("💡 Please run Notebook 1 first to process the PDF")
        return [], {}
    except Exception as e:
        print(f"❌ Unexpected error: {e}")
        return [], {}

# Load the data
chunks, config = load_processed_data()

if not chunks:
    print("❌ Cannot proceed without processed chunks.")
else:
    # Show sample chunk
    print(f"\n📋 Sample chunk metadata:")
    print(chunks[0].metadata)
    print(f"\n📝 Sample content preview:")
    print(chunks[0].page_content[:200] + "...")

✅ Successfully loaded processed data:
   - Chunks: 266
   - Chunk size: 800
   - Overlap: 120
   - Estimated tokens: 46,778
   - Estimated cost: $0.0009

📋 Sample chunk metadata:
{'document_type': 'zdh_gdpr_handbook', 'document_name': 'ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf', 'language': 'german', 'source': '../2_data/raw/ZDH_LEITFADEN_DATENSCHUTZ_BETRIEBE_HANDWERKER.pdf', 'page_number': 1, 'total_pages': 99, 'content_length': 121, 'content_category': 'legal_basis', 'section_type': 'content', 'creationdate': '2020-11-06T11:24:59+01:00', 'author': 'Kasper, Lisa', 'moddate': '2020-11-06T11:24:59+01:00', 'page': 0, 'page_label': '1', 'chunk_id': 1, 'chunk_size': 121, 'total_chunks': 266}

📝 Sample content preview:
Leitfaden 
Datenschutzrecht 
Was Betriebe zu beachten haben 
 
 
Stand: November 2020 
 
Abteilung Organisation und Recht...



---

## 🤖 Initialize OpenAI Embeddings

*Set up the embedding model that will convert text to vectors*


In [5]:
# Cell 4: Initialize OpenAI Embeddings
def initialize_embeddings():
    """Initialize OpenAI embeddings without test query"""
    try:
        embeddings = OpenAIEmbeddings(
            model="text-embedding-3-small",
            openai_api_key=os.getenv('OPENAI_API_KEY')
        )
        
        print("✅ OpenAI Embeddings initialized successfully!")
        print(f"   Model: text-embedding-3-small")
        print(f"   Dimension: 1536 (for text-embedding-3-small)")
        
        return embeddings
        
    except Exception as e:
        print(f"❌ Error initializing OpenAI embeddings: {e}")
        print("💡 Check your OPENAI_API_KEY in .env file")
        return None

# Initialize embeddings
embeddings = initialize_embeddings()

✅ OpenAI Embeddings initialized successfully!
   Model: text-embedding-3-small
   Dimension: 1536 (for text-embedding-3-small)


---

## 🗃️ Pinecone Setup

*Initialize connection to Pinecone vector database*

In [6]:
# Cell 5: Pinecone Setup (Optimized for PineconeVectorStore)
def initialize_pinecone_for_vectorstore(index_name="gdpr-compliance-openai"):
    """Initialize Pinecone specifically for PineconeVectorStore compatibility"""
    try:
        from pinecone import Pinecone, ServerlessSpec
        
        pc = Pinecone(api_key=os.getenv('PINECONE_API_KEY'))
        
        print("✅ Pinecone initialized for PineconeVectorStore!")
        
        existing_indexes = pc.list_indexes().names()
        print(f"📋 Existing indexes: {existing_indexes}")
        
        if index_name not in existing_indexes:
            print(f"📦 Creating new index: {index_name}")
            
            pc.create_index(
                name=index_name,
                dimension=1536,
                metric='cosine',
                spec=ServerlessSpec(
                    cloud='aws',
                    region='us-east-1'
                )
            )
            
            print("⏳ Waiting for index to initialize...")
            while not pc.describe_index(index_name).status['ready']:
                time.sleep(1)
            
            print("✅ Index created and ready!")
        else:
            print(f"✅ Using existing index: {index_name}")
        
        # Return both pc and index for flexibility
        index = pc.Index(index_name)
        time.sleep(1)
        
        stats = index.describe_index_stats()
        print(f"📊 Index statistics:")
        print(f"   - Total vectors: {stats['total_vector_count']}")
        
        return pc, index
        
    except Exception as e:
        print(f"❌ Error initializing Pinecone: {e}")
        return None, None

# Initialize Pinecone
pc, index = initialize_pinecone_for_vectorstore()

✅ Pinecone initialized for PineconeVectorStore!
📋 Existing indexes: ['extractive-question-answering', 'gdpr-compliance-openai', 'langchain-retrieval-agent', 'abstractive-qa-history', 'gdpr-compliance']
✅ Using existing index: gdpr-compliance-openai
📊 Index statistics:
   - Total vectors: 0




---

## 🚀 Upload to Vector Database

*Upload document chunks to Pinecone with embeddings*



In [9]:
# # Cell 6: Current Vector Store Setup

def create_pinecone_vectorstore_simple(chunks, index_name="gdpr-compliance-openai"):
    """Simple version without cost calculation"""
    
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    
    print("🔄 Creating Pinecone vector store...")
    vectorstore = PineconeVectorStore.from_documents(
        documents=chunks,
        embedding=embeddings,
        index_name=index_name
    )
    
    print(f"✅ Successfully loaded {len(chunks)} documents into Pinecone")
    
    # Optional: Show basic stats
    total_chars = sum(len(chunk.page_content) for chunk in chunks)
    print(f"📊 Stats: {len(chunks)} chunks, ~{total_chars} characters")
    
    return vectorstore

# Use the simple version
vectorstore = create_pinecone_vectorstore_simple(chunks)


🔄 Creating Pinecone vector store...
✅ Successfully loaded 266 documents into Pinecone
📊 Stats: 266 chunks, ~170805 characters



---

## ✅ Verification & Testing

*Verify that the upload worked and test search functionality*. (No Tokens used, only search similarity?)


In [14]:
# Cell 5: Test Vector Store and Build RAG Chain
print("🧪 Testing Vector Store Retrieval...")

# Test with some sample queries
# Test with German queries that match your data protection content
test_queries = [
    "Was ist die Datenschutzrichtlinie?",
    "Wie sollen Kundendaten behandelt werden?",
    "Was sind die GDPR-Anforderungen?",
    "Wie geht man mit personenbezogenen Daten um?",
    "Was muss bei Datenverarbeitung beachtet werden?",
    "Welche Rechte haben Kunden bezüglich ihrer Daten?"
]


for query in test_queries:
    print(f"\n🔍 Query: '{query}'")
    results = vectorstore.similarity_search(query, k=2)
    
    print(f"   Found {len(results)} relevant chunks:")
    for i, doc in enumerate(results):
        print(f"   {i+1}. {doc.page_content[:150]}...")
    print("   " + "─" * 50)

print("\n✅ Vector store test completed!")

🧪 Testing Vector Store Retrieval...

🔍 Query: 'Was ist die Datenschutzrichtlinie?'
   Found 2 relevant chunks:
   1. Leitfaden 
Datenschutzrecht 
Was Betriebe zu beachten haben 
 
 
Stand: November 2020 
 
Abteilung Organisation und Recht...
   2. 3. Formelle Pflichten von Betrieben – Ein Überblick  
 
Welchen Zweck verfolgen die Pflichten?  
 
Das Datenschutzrecht räumt Personen, deren Daten vo...
   ──────────────────────────────────────────────────

🔍 Query: 'Wie sollen Kundendaten behandelt werden?'
   Found 2 relevant chunks:
   1. Kunde 
 
 
Familienname 
 
 
Vorname 
 
 
Geburtsname 
 
 
Geschlecht 
 
 
Geburtsdatum 
 
 
Staatsangehörigkeit 
 
 
Straße 
 
 
PLZ 
 
 
Wohnort 
 
...
   2. Gesetz vorgesehene Informationen zu erteilen. Dies sind im Einzelnen: 
  
◼ Alle über den Betroffenen gespeicherten Daten (z.B. Name, Anschrift, E -Ma...
   ──────────────────────────────────────────────────

🔍 Query: 'Was sind die GDPR-Anforderungen?'
   Found 2 relevant chunks:
   1. Rechtliche

In [10]:

# Cell 7: Verify Upload and Test Search
def verify_pinecone_upload():
    """Verify the upload was successful and test search functionality"""
    try:
        # Get updated index stats
        index = pinecone.Index("gdpr-compliance")
        stats = index.describe_index_stats()
        
        print("📊 Final Vector Database Status:")
        print(f"   Total vectors: {stats['total_vector_count']}")
        print(f"   Dimension: {stats['dimension']}")
        
        expected_vectors = len(chunks)
        actual_vectors = stats['total_vector_count']
        
        if actual_vectors >= expected_vectors:
            print(f"   ✅ Upload successful: {actual_vectors} vectors stored")
        else:
            print(f"   ⚠️  Partial upload: {actual_vectors}/{expected_vectors} vectors")
        
        return True
        
    except Exception as e:
        print(f"❌ Error verifying upload: {e}")
        return False


In [11]:

def test_search_functionality(vectorstore):
    """Test that search is working with sample queries"""
    if not vectorstore:
        print("❌ No vectorstore available for testing")
        return
    
    print("\n🔍 Testing Search Functionality:")
    
    # Test queries in German (matching our document language)
    test_queries = [
        "Datenschutz Grundverordnung",
        "Kundendaten Aufbewahrung", 
        "Mitarbeiter Daten",
        "Einwilligung Marketing",
        "Datenpanne Meldefrist"
    ]
    
    for query in test_queries:
        print(f"\n   Query: '{query}'")
        
        try:
            # Search for similar documents
            results = vectorstore.similarity_search(query, k=2)
            
            print(f"      Found {len(results)} relevant documents:")
            
            for i, result in enumerate(results):
                category = result.metadata.get('content_category', 'N/A')
                page = result.metadata.get('page_number', 'N/A')
                print(f"      {i+1}. Category: {category}, Page: {page}")
                print(f"         Preview: {result.page_content[:80]}...")
                
        except Exception as e:
            print(f"      ❌ Search error: {e}")


In [12]:

# Run verification and tests
if vectorstore:
    verification_success = verify_pinecone_upload()
    if verification_success:
        test_search_functionality(vectorstore)
else:
    print("❌ Cannot verify - no vectorstore created")


❌ Error verifying upload: module 'pinecone' has no attribute 'Index'



---

## 💾 Save Vector Store Reference

*Save the vector store reference for use in Notebook 3*


In [None]:
# Cell 8: Save Configuration for Next Notebook
def save_vectorstore_config(vectorstore, chunks):
    """Save configuration for the next notebook"""
    try:
        config = {
            "index_name": "gdpr-compliance",
            "embedding_model": "text-embedding-3-small", 
            "total_chunks": len(chunks),
            "vectorstore_ready": vectorstore is not None
        }
        
        with open("../data/processed/vectorstore_config.pkl", "wb") as f:
            pickle.dump(config, f)
        
        print("💾 Saved vectorstore configuration for Notebook 3")
        print(f"   File: ../data/processed/vectorstore_config.pkl")
        
        return config
        
    except Exception as e:
        print(f"❌ Error saving configuration: {e}")
        return {}

# Save configuration
if vectorstore:
    final_config = save_vectorstore_config(vectorstore, chunks)



---

## 🎉 Next Steps

*Summary and preparation for Notebook 3*

```python
# Cell 9: Completion Summary
print("\n" + "="*60)
print("🎉 NOTEBOOK 2 COMPLETE!")
print("="*60)

if vectorstore:
    print("✅ SUCCESS: Vector database is ready!")
    print(f"   - {len(chunks)} document chunks uploaded")
    print(f"   - Embeddings created with text-embedding-3-small")
    print(f"   - Pinecone index: gdpr-compliance")
    print(f"   - Search functionality verified")
    
    print("\n➡️  NEXT: Notebook 3 - RAG Agent")
    print("   - Create question-answering system")
    print("   - Connect GPT model to vector database")
    print("   - Build complete RAG pipeline")
    
else:
    print("❌ INCOMPLETE: Issues with vector database setup")
    print("   Please check:")
    print("   - API keys in .env file")
    print("   - Pinecone index exists")
    print("   - Internet connection for API calls")

print("\n📁 Files created for next notebook:")
print("   - ../data/processed/vectorstore_config.pkl")
print("="*60)
```

## 🚀 Ready for Notebook 3!

Your vector database is now populated and ready for the RAG agent in Notebook 3!

------
------

In [None]:
asd


-----
-----
# Draft

## 🔢 Embeddings Setup

*Initialize the embedding model that converts text to numbers*

**1st Model Choice**: [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
- Good balance of speed and quality
- 384-dimensional embeddings
- Well-tested for retrieval tasks

**NOTE:**: this model by default, any input text longer than 256 word pieces is truncated.

**How embeddings work**:
- Similar texts have similar vectors
- Mathematical distance = semantic similarity
- Enables meaning-based search

In [7]:
# Cell 2: Load Text Chunks from Previous Notebook
try:
    with open("../2_data/processed/text_chunks.pkl", "rb") as f:
        chunks = pickle.load(f)
    print(f"✅ Loaded {len(chunks)} text chunks")
except FileNotFoundError:
    print("❌ Please run the PDF processing notebook first!")
    chunks = []

✅ Loaded 2 text chunks


In [9]:
# DEBUG: Check what type we have
print(f"Type of first chunk: {type(chunks)}")

Type of first chunk: <class 'dict'>


In [10]:
# FIX: Convert strings to Document objects if needed
if isinstance(chunks[0], str):
    print("🔄 Converting strings to Document objects...")
    chunks = [Document(page_content=chunk) for chunk in chunks]
    print("✅ Conversion completed!")

print(f"First chunk preview: {chunks[0].page_content[:100]}...")

KeyError: 0

In [3]:
# Cell 3: Initialize Embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

print("✅ Embeddings model loaded!")

# Test the embeddings
sample_text = "GDPR compliance for small businesses"
sample_embedding = embeddings.embed_query(sample_text)
print(f"📐 Embedding dimension: {len(sample_embedding)}")

  embeddings = HuggingFaceEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Embeddings model loaded!
📐 Embedding dimension: 384


In [4]:
# Cell 4: Create Vector Database
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="../2_data/processed/chroma_db"
)

print("✅ Vector database created and persisted!")

AttributeError: 'str' object has no attribute 'page_content'

In [None]:
# Cell 5: Test Similarity Search
print("🔍 Testing similarity search...")

test_queries = [
    "data retention periods",
    "customer consent for marketing",
    "employee record keeping"
]

for query in test_queries:
    print(f"\nQuery: '{query}'")
    results = vectorstore.similarity_search(query, k=2)
    
    for i, result in enumerate(results):
        print(f"Result {i+1}: {result.page_content[:100]}...")

In [None]:
# Cell 6: Verify Database Persistence
# Let's reload to verify it works
vectorstore_reloaded = Chroma(
    persist_directory="data/processed/chroma_db",
    embedding_function=embeddings
)

print("✅ Vector database reloaded successfully!")
print(f"📊 Collection count: {vectorstore_reloaded._collection.count()}")

In [None]:
# Cell 7: Advanced Search Tests
print("\n🎯 Testing different search types:")

# Search with metadata filter (if we had any)
results = vectorstore_reloaded.similarity_search(
    "data breach procedures", 
    k=3
)

print(f"Found {len(results)} relevant documents for 'data breach procedures'")

In [None]:
# Cell 8: Prepare for Next Notebook
print("\n✅ Vector database ready for RAG agent!")
print("Next: Create the question-answering system")