# 🗄️ GDPR Compliance Agent - Notebook 2: Vector Database

## 📋 Table of Contents
1. [Overview](#overview)
2. [Load Previous Work](#load-previous-work)
3. [Embeddings Setup](#embeddings-setup)
4. [Vector Database Creation](#vector-database-creation)
5. [Similarity Search Testing](#similarity-search-testing)
6. [Persistence Verification](#persistence-verification)
7. [Advanced Search Features](#advanced-search-features)
8. [Preparation for Next Step](#preparation-for-next-step)

---

## 🎯 Overview

**Goal**: Create a searchable knowledge base from our text chunks

**This Notebook Focus**: 
- Generate embeddings for text chunks
- Store in vector database (ChromaDB)
- Test retrieval capabilities

**Key Concepts**:
- **Embeddings**: Numerical representations of text
- **Vector Database**: Specialized storage for embeddings
- **Similarity Search**: Find relevant documents based on meaning

---

## 📥 Load Previous Work

*Load the text chunks we created in Notebook 1*

**What we're loading**:
- Processed text chunks
- Metadata about each chunk

**Error Handling**: Check if previous steps were completed successfully

In [6]:
# Cell 1: Setup and Load Previous Work
import os
import pickle
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

print("🚀 Setting up Vector Database...")

🚀 Setting up Vector Database...


## 🔢 Embeddings Setup

*Initialize the embedding model that converts text to numbers*

**1st Model Choice**: [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
- Good balance of speed and quality
- 384-dimensional embeddings
- Well-tested for retrieval tasks

**NOTE:**: this model by default, any input text longer than 256 word pieces is truncated.

**How embeddings work**:
- Similar texts have similar vectors
- Mathematical distance = semantic similarity
- Enables meaning-based search

In [7]:
# Cell 2: Load Text Chunks from Previous Notebook
try:
    with open("../2_data/processed/text_chunks.pkl", "rb") as f:
        chunks = pickle.load(f)
    print(f"✅ Loaded {len(chunks)} text chunks")
except FileNotFoundError:
    print("❌ Please run the PDF processing notebook first!")
    chunks = []

✅ Loaded 2 text chunks


In [9]:
# DEBUG: Check what type we have
print(f"Type of first chunk: {type(chunks)}")

Type of first chunk: <class 'dict'>


In [10]:
# FIX: Convert strings to Document objects if needed
if isinstance(chunks[0], str):
    print("🔄 Converting strings to Document objects...")
    chunks = [Document(page_content=chunk) for chunk in chunks]
    print("✅ Conversion completed!")

print(f"First chunk preview: {chunks[0].page_content[:100]}...")

KeyError: 0

In [3]:
# Cell 3: Initialize Embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

print("✅ Embeddings model loaded!")

# Test the embeddings
sample_text = "GDPR compliance for small businesses"
sample_embedding = embeddings.embed_query(sample_text)
print(f"📐 Embedding dimension: {len(sample_embedding)}")

  embeddings = HuggingFaceEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Embeddings model loaded!
📐 Embedding dimension: 384


In [4]:
# Cell 4: Create Vector Database
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="../2_data/processed/chroma_db"
)

print("✅ Vector database created and persisted!")

AttributeError: 'str' object has no attribute 'page_content'

In [None]:
# Cell 5: Test Similarity Search
print("🔍 Testing similarity search...")

test_queries = [
    "data retention periods",
    "customer consent for marketing",
    "employee record keeping"
]

for query in test_queries:
    print(f"\nQuery: '{query}'")
    results = vectorstore.similarity_search(query, k=2)
    
    for i, result in enumerate(results):
        print(f"Result {i+1}: {result.page_content[:100]}...")

In [None]:
# Cell 6: Verify Database Persistence
# Let's reload to verify it works
vectorstore_reloaded = Chroma(
    persist_directory="data/processed/chroma_db",
    embedding_function=embeddings
)

print("✅ Vector database reloaded successfully!")
print(f"📊 Collection count: {vectorstore_reloaded._collection.count()}")

In [None]:
# Cell 7: Advanced Search Tests
print("\n🎯 Testing different search types:")

# Search with metadata filter (if we had any)
results = vectorstore_reloaded.similarity_search(
    "data breach procedures", 
    k=3
)

print(f"Found {len(results)} relevant documents for 'data breach procedures'")

In [None]:
# Cell 8: Prepare for Next Notebook
print("\n✅ Vector database ready for RAG agent!")
print("Next: Create the question-answering system")