# Module 6: Vector Stores & Databases

## 🎯 Learning Objectives
By the end of this module, you will:
- Understand vector database architecture and why traditional databases aren't suitable for similarity search
- Compare different vector store options (local vs cloud, open-source vs commercial)
- Implement CRUD operations for vectors with metadata
- Design effective metadata schemas for filtering and organization
- Benchmark and optimize vector database performance

## 📚 Key Concepts

### Why Vector Databases?

Traditional databases excel at exact matches but struggle with **similarity search**:

```sql
-- Traditional SQL: Exact match
SELECT * FROM documents WHERE title = 'Machine Learning';

-- What we need: Similarity search
SELECT * FROM documents WHERE embedding SIMILAR TO query_embedding;
```

#### 🚫 Traditional Database Limitations
1. **No built-in similarity search**: SQL doesn't understand "semantic closeness"
2. **Inefficient for high-dimensional data**: Traditional indexes don't work well for 1000+ dimensions
3. **Slow approximate search**: Need specialized algorithms like HNSW, not B-trees
4. **Poor scalability**: Linear scan becomes prohibitive with millions of vectors

#### ✅ Vector Database Advantages
1. **Approximate Nearest Neighbor (ANN)**: Fast similarity search with controllable accuracy
2. **Optimized indexing**: HNSW, IVF, LSH algorithms designed for high-dimensional vectors
3. **Metadata filtering**: Combine vector similarity with traditional filters
4. **Horizontal scaling**: Built for production workloads with millions/billions of vectors

### 2025 Vector Database Landscape 🏆

| Database | Query Latency | Cost | Best For |
|----------|---------------|------|---------|
| **Pinecone** | 23ms p95 | High | Enterprise, turnkey scale |
| **Qdrant** | ~30ms | Low | Complex filters, self-hosted |
| **Weaviate** | 34ms p95 | Medium | OSS flexibility, GraphQL |
| **Milvus** | Lowest | Variable | GPU acceleration |
| **Chroma** | 20ms p50 | Free | Fast prototyping |

### Database Categories

#### 🏠 Local/Embedded Options
- **Chroma**: SQLite-based, perfect for development
- **FAISS**: Meta's library, CPU/GPU optimized
- **Hnswlib**: Pure HNSW implementation, very fast

#### ☁️ Cloud/Managed Options
- **Pinecone**: Fully managed, highest performance
- **Weaviate**: Open-source with cloud hosting
- **Qdrant**: Rust-based, excellent cost/performance

#### 🗄️ Traditional DB Extensions
- **pgvector**: PostgreSQL extension
- **Redis**: In-memory vector search
- **Elasticsearch**: Dense vector search support

## 🛠️ Setup
Let's install the required packages and set up our environment.

In [None]:
# Install required packages
!pip install -q chromadb qdrant-client faiss-cpu sentence-transformers numpy pandas matplotlib seaborn
!pip install -q langchain langchain-chroma langchain-community openai python-dotenv
# Note: For Pinecone, add: pinecone-client
# Note: For Weaviate, add: weaviate-client

In [None]:
import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from typing import List, Dict, Any, Tuple
import json

# LangChain imports
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Vector database imports
import chromadb
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, Range
import faiss

# Embedding models
from sentence_transformers import SentenceTransformer

from dotenv import load_dotenv
load_dotenv()

# Set up visualization
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ Setup complete!")
print(f"📅 Today's date: {datetime.now().strftime('%Y-%m-%d')}")

## 🧪 Exercise 1: Traditional Database vs Vector Database

Let's demonstrate why traditional databases struggle with similarity search.

In [None]:
# Create sample documents
sample_documents = [
    "Machine learning algorithms can learn patterns from data",
    "Deep learning uses neural networks with multiple layers",
    "Natural language processing helps computers understand text",
    "Computer vision enables machines to interpret images",
    "Reinforcement learning trains agents through rewards",
    "The weather today is sunny and warm",
    "I love cooking pasta with tomato sauce",
    "Basketball is a popular sport worldwide",
    "Python is a versatile programming language",
    "Data science combines statistics and programming"
]

# Create embeddings
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedding_model.encode(sample_documents)

print(f"📊 Created {len(sample_documents)} documents")
print(f"🔢 Embedding dimensions: {embeddings.shape[1]}")
print(f"💾 Total embedding size: {embeddings.nbytes:,} bytes")

In [None]:
def traditional_keyword_search(query: str, documents: List[str]) -> List[Tuple[int, str]]:
    """Simulate traditional keyword-based search"""
    query_words = set(query.lower().split())
    results = []
    
    for i, doc in enumerate(documents):
        doc_words = set(doc.lower().split())
        # Simple keyword overlap score
        overlap = len(query_words.intersection(doc_words))
        if overlap > 0:
            results.append((i, doc, overlap))
    
    # Sort by overlap score
    results.sort(key=lambda x: x[2], reverse=True)
    return [(idx, doc) for idx, doc, _ in results[:3]]

def vector_similarity_search(query: str, documents: List[str], embeddings: np.ndarray) -> List[Tuple[int, str, float]]:
    """Vector-based similarity search"""
    query_embedding = embedding_model.encode([query])[0]
    
    # Calculate cosine similarity
    similarities = np.dot(embeddings, query_embedding) / (
        np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_embedding)
    )
    
    # Get top 3 results
    top_indices = np.argsort(similarities)[::-1][:3]
    
    return [(idx, documents[idx], similarities[idx]) for idx in top_indices]

# Test both approaches
test_queries = [
    "artificial intelligence and neural networks",
    "understanding human language",
    "programming and data analysis"
]

print("🔍 SEARCH COMPARISON")
print("=" * 60)

for query in test_queries:
    print(f"\n❓ Query: '{query}'")
    
    # Traditional search
    print("\n🗂️ Traditional Keyword Search:")
    traditional_results = traditional_keyword_search(query, sample_documents)
    if traditional_results:
        for i, (idx, doc) in enumerate(traditional_results):
            print(f"   {i+1}. {doc}")
    else:
        print("   ❌ No results found (no keyword matches)")
    
    # Vector search
    print("\n🧠 Vector Similarity Search:")
    vector_results = vector_similarity_search(query, sample_documents, embeddings)
    for i, (idx, doc, score) in enumerate(vector_results):
        print(f"   {i+1}. {doc} (similarity: {score:.3f})")
    
    print("-" * 60)

## 🏗️ Exercise 2: Vector Database Architecture Comparison

Let's set up and compare different vector databases.

In [None]:
class VectorDBBenchmark:
    """Benchmark different vector database implementations"""
    
    def __init__(self):
        self.results = {}
        
    def setup_chroma(self, documents: List[str], embeddings: np.ndarray) -> chromadb.Collection:
        """Set up Chroma vector database"""
        client = chromadb.Client()
        
        # Create or get collection
        try:
            collection = client.create_collection(
                name="test_collection",
                metadata={"hnsw:space": "cosine"}
            )
        except:
            client.delete_collection("test_collection")
            collection = client.create_collection(
                name="test_collection",
                metadata={"hnsw:space": "cosine"}
            )
        
        # Add documents
        collection.add(
            embeddings=embeddings.tolist(),
            documents=documents,
            ids=[f"doc_{i}" for i in range(len(documents))],
            metadatas=[{"source": "sample", "index": i} for i in range(len(documents))]
        )
        
        return collection
    
    def setup_qdrant_memory(self, documents: List[str], embeddings: np.ndarray) -> QdrantClient:
        """Set up Qdrant in-memory database"""
        client = QdrantClient(":memory:")
        
        # Create collection
        client.create_collection(
            collection_name="test_collection",
            vectors_config=VectorParams(
                size=embeddings.shape[1],
                distance=Distance.COSINE
            )
        )
        
        # Add points
        points = [
            PointStruct(
                id=i,
                vector=embeddings[i].tolist(),
                payload={
                    "text": documents[i],
                    "source": "sample",
                    "index": i
                }
            )
            for i in range(len(documents))
        ]
        
        client.upsert(
            collection_name="test_collection",
            points=points
        )
        
        return client
    
    def setup_faiss(self, embeddings: np.ndarray) -> faiss.IndexFlatIP:
        """Set up FAISS index"""
        # Normalize embeddings for cosine similarity
        normalized_embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
        
        # Create index
        index = faiss.IndexFlatIP(embeddings.shape[1])  # Inner product for cosine similarity
        index.add(normalized_embeddings.astype('float32'))
        
        return index, normalized_embeddings
    
    def benchmark_search(self, query: str, databases: Dict[str, Any], documents: List[str]) -> Dict[str, Dict]:
        """Benchmark search performance across databases"""
        query_embedding = embedding_model.encode([query])[0]
        results = {}
        
        # Chroma search
        if 'chroma' in databases:
            start_time = time.time()
            chroma_results = databases['chroma'].query(
                query_embeddings=[query_embedding.tolist()],
                n_results=3
            )
            chroma_time = time.time() - start_time
            results['chroma'] = {
                'time': chroma_time * 1000,  # Convert to ms
                'results': list(zip(chroma_results['documents'][0], chroma_results['distances'][0]))
            }
        
        # Qdrant search
        if 'qdrant' in databases:
            start_time = time.time()
            qdrant_results = databases['qdrant'].search(
                collection_name="test_collection",
                query_vector=query_embedding.tolist(),
                limit=3
            )
            qdrant_time = time.time() - start_time
            results['qdrant'] = {
                'time': qdrant_time * 1000,
                'results': [(point.payload['text'], point.score) for point in qdrant_results]
            }
        
        # FAISS search
        if 'faiss' in databases:
            faiss_index, normalized_embeddings = databases['faiss']
            normalized_query = query_embedding / np.linalg.norm(query_embedding)
            
            start_time = time.time()
            scores, indices = faiss_index.search(
                normalized_query.reshape(1, -1).astype('float32'), 3
            )
            faiss_time = time.time() - start_time
            results['faiss'] = {
                'time': faiss_time * 1000,
                'results': [(documents[idx], score) for idx, score in zip(indices[0], scores[0])]
            }
        
        return results

# Initialize benchmark
benchmark = VectorDBBenchmark()

print("🏗️ Setting up vector databases...")

# Set up databases
databases = {}

print("   Setting up Chroma...")
databases['chroma'] = benchmark.setup_chroma(sample_documents, embeddings)

print("   Setting up Qdrant (in-memory)...")
databases['qdrant'] = benchmark.setup_qdrant_memory(sample_documents, embeddings)

print("   Setting up FAISS...")
databases['faiss'] = benchmark.setup_faiss(embeddings)

print("✅ All databases ready!")

In [None]:
# Benchmark search performance
test_query = "machine learning and AI algorithms"

print(f"🚀 VECTOR DATABASE BENCHMARK")
print(f"Query: '{test_query}'")
print("=" * 60)

benchmark_results = benchmark.benchmark_search(test_query, databases, sample_documents)

# Display results
performance_data = []

for db_name, result in benchmark_results.items():
    print(f"\n🗄️ {db_name.upper()} Results:")
    print(f"   ⏱️ Query time: {result['time']:.2f}ms")
    print("   📄 Top results:")
    
    for i, (doc, score) in enumerate(result['results']):
        print(f"      {i+1}. {doc[:50]}... (score: {score:.3f})")
    
    performance_data.append({
        'Database': db_name.title(),
        'Query Time (ms)': result['time'],
        'Top Score': result['results'][0][1] if result['results'] else 0
    })

# Create performance comparison
perf_df = pd.DataFrame(performance_data)
print(f"\n📊 Performance Summary:")
print(perf_df.to_string(index=False))

In [None]:
# Visualize performance comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Query time comparison
bars1 = ax1.bar(perf_df['Database'], perf_df['Query Time (ms)'], 
                color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
ax1.set_title('Query Time Comparison', fontsize=14, fontweight='bold')
ax1.set_ylabel('Time (milliseconds)')
ax1.set_xlabel('Vector Database')

# Add value labels on bars
for bar, value in zip(bars1, perf_df['Query Time (ms)']):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{value:.2f}ms', ha='center', va='bottom', fontweight='bold')

# Top score comparison
bars2 = ax2.bar(perf_df['Database'], perf_df['Top Score'], 
                color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
ax2.set_title('Similarity Score Comparison', fontsize=14, fontweight='bold')
ax2.set_ylabel('Similarity Score')
ax2.set_xlabel('Vector Database')

# Add value labels on bars
for bar, value in zip(bars2, perf_df['Top Score']):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
             f'{value:.3f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("\n💡 Key Observations:")
print("- FAISS typically offers the fastest query times for pure vector similarity")
print("- Chroma provides a good balance of speed and ease of use for development")
print("- Qdrant offers advanced filtering capabilities with competitive performance")
print("- All databases should return similar similarity scores for the same query")

## 📊 Exercise 3: CRUD Operations and Metadata Management

Let's implement comprehensive CRUD (Create, Read, Update, Delete) operations with metadata.

In [None]:
class AdvancedVectorStore:
    """Advanced vector store with comprehensive CRUD operations"""
    
    def __init__(self, db_type="chroma"):
        self.db_type = db_type
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        
        if db_type == "chroma":
            self.client = chromadb.Client()
            try:
                self.collection = self.client.create_collection(
                    name="advanced_collection",
                    metadata={"hnsw:space": "cosine"}
                )
            except:
                self.client.delete_collection("advanced_collection")
                self.collection = self.client.create_collection(
                    name="advanced_collection",
                    metadata={"hnsw:space": "cosine"}
                )
        elif db_type == "qdrant":
            self.client = QdrantClient(":memory:")
            self.collection_name = "advanced_collection"
            self.client.create_collection(
                collection_name=self.collection_name,
                vectors_config=VectorParams(size=384, distance=Distance.COSINE)
            )
    
    def create_document(self, doc_id: str, text: str, metadata: Dict[str, Any]) -> bool:
        """Create a new document with embedding and metadata"""
        try:
            embedding = self.embedding_model.encode([text])[0]
            
            if self.db_type == "chroma":
                self.collection.add(
                    embeddings=[embedding.tolist()],
                    documents=[text],
                    ids=[doc_id],
                    metadatas=[metadata]
                )
            elif self.db_type == "qdrant":
                point = PointStruct(
                    id=doc_id,
                    vector=embedding.tolist(),
                    payload={**metadata, "text": text}
                )
                self.client.upsert(
                    collection_name=self.collection_name,
                    points=[point]
                )
            
            return True
        except Exception as e:
            print(f"Error creating document {doc_id}: {e}")
            return False
    
    def read_document(self, doc_id: str) -> Dict[str, Any]:
        """Read a specific document by ID"""
        try:
            if self.db_type == "chroma":
                result = self.collection.get(ids=[doc_id])
                if result['ids']:
                    return {
                        'id': result['ids'][0],
                        'text': result['documents'][0],
                        'metadata': result['metadatas'][0],
                        'embedding': result['embeddings'][0] if result['embeddings'] else None
                    }
            elif self.db_type == "qdrant":
                result = self.client.retrieve(
                    collection_name=self.collection_name,
                    ids=[doc_id],
                    with_payload=True,
                    with_vectors=True
                )
                if result:
                    point = result[0]
                    return {
                        'id': point.id,
                        'text': point.payload.get('text'),
                        'metadata': {k: v for k, v in point.payload.items() if k != 'text'},
                        'embedding': point.vector
                    }
            return None
        except Exception as e:
            print(f"Error reading document {doc_id}: {e}")
            return None
    
    def update_document(self, doc_id: str, text: str = None, metadata: Dict[str, Any] = None) -> bool:
        """Update an existing document"""
        try:
            # For updates, we need to recreate the document
            if text is not None:
                current_doc = self.read_document(doc_id)
                if current_doc:
                    updated_metadata = current_doc['metadata'].copy()
                    if metadata:
                        updated_metadata.update(metadata)
                    
                    # Delete and recreate
                    self.delete_document(doc_id)
                    return self.create_document(doc_id, text, updated_metadata)
            return False
        except Exception as e:
            print(f"Error updating document {doc_id}: {e}")
            return False
    
    def delete_document(self, doc_id: str) -> bool:
        """Delete a document by ID"""
        try:
            if self.db_type == "chroma":
                self.collection.delete(ids=[doc_id])
            elif self.db_type == "qdrant":
                self.client.delete(
                    collection_name=self.collection_name,
                    points_selector=[doc_id]
                )
            return True
        except Exception as e:
            print(f"Error deleting document {doc_id}: {e}")
            return False
    
    def search_with_filters(self, query: str, filters: Dict[str, Any] = None, top_k: int = 5) -> List[Dict]:
        """Search with metadata filtering"""
        try:
            query_embedding = self.embedding_model.encode([query])[0]
            
            if self.db_type == "chroma":
                where_clause = filters if filters else None
                results = self.collection.query(
                    query_embeddings=[query_embedding.tolist()],
                    n_results=top_k,
                    where=where_clause
                )
                
                return [
                    {
                        'id': results['ids'][0][i],
                        'text': results['documents'][0][i],
                        'metadata': results['metadatas'][0][i],
                        'score': 1 - results['distances'][0][i]  # Convert distance to similarity
                    }
                    for i in range(len(results['ids'][0]))
                ]
            
            elif self.db_type == "qdrant":
                # Convert filters to Qdrant format
                qdrant_filter = None
                if filters:
                    conditions = []
                    for key, value in filters.items():
                        if isinstance(value, (int, float)):
                            conditions.append(FieldCondition(key=key, range=Range(gte=value, lte=value)))
                        else:
                            conditions.append(FieldCondition(key=key, match={"value": value}))
                    qdrant_filter = Filter(must=conditions)
                
                results = self.client.search(
                    collection_name=self.collection_name,
                    query_vector=query_embedding.tolist(),
                    limit=top_k,
                    query_filter=qdrant_filter
                )
                
                return [
                    {
                        'id': point.id,
                        'text': point.payload.get('text'),
                        'metadata': {k: v for k, v in point.payload.items() if k != 'text'},
                        'score': point.score
                    }
                    for point in results
                ]
            
            return []
        except Exception as e:
            print(f"Error in filtered search: {e}")
            return []
    
    def get_collection_stats(self) -> Dict[str, Any]:
        """Get statistics about the collection"""
        try:
            if self.db_type == "chroma":
                count = self.collection.count()
                return {
                    'total_documents': count,
                    'database_type': 'Chroma'
                }
            elif self.db_type == "qdrant":
                info = self.client.get_collection(self.collection_name)
                return {
                    'total_documents': info.points_count,
                    'database_type': 'Qdrant',
                    'vector_size': info.config.params.vectors.size,
                    'distance_metric': info.config.params.vectors.distance
                }
        except Exception as e:
            print(f"Error getting stats: {e}")
            return {}

# Initialize advanced vector store
print("🏗️ Initializing Advanced Vector Store...")
vector_store = AdvancedVectorStore(db_type="chroma")
print("✅ Advanced Vector Store ready!")

In [None]:
# Create sample documents with rich metadata
sample_documents_with_metadata = [
    {
        "id": "ml_001",
        "text": "Machine learning algorithms can automatically learn patterns from historical data without being explicitly programmed.",
        "metadata": {
            "category": "machine_learning",
            "difficulty": "beginner",
            "topic": "algorithms",
            "word_count": 14,
            "author": "AI_Expert",
            "date": "2025-01-01"
        }
    },
    {
        "id": "dl_001",
        "text": "Deep learning uses neural networks with multiple hidden layers to model complex patterns in data.",
        "metadata": {
            "category": "deep_learning",
            "difficulty": "intermediate",
            "topic": "neural_networks",
            "word_count": 15,
            "author": "DL_Researcher",
            "date": "2025-01-02"
        }
    },
    {
        "id": "nlp_001",
        "text": "Natural language processing enables computers to understand, interpret, and generate human language.",
        "metadata": {
            "category": "nlp",
            "difficulty": "intermediate",
            "topic": "language_understanding",
            "word_count": 13,
            "author": "NLP_Specialist",
            "date": "2025-01-03"
        }
    },
    {
        "id": "cv_001",
        "text": "Computer vision algorithms analyze and interpret visual information from images and videos.",
        "metadata": {
            "category": "computer_vision",
            "difficulty": "advanced",
            "topic": "image_processing",
            "word_count": 12,
            "author": "CV_Engineer",
            "date": "2025-01-04"
        }
    },
    {
        "id": "rl_001",
        "text": "Reinforcement learning trains agents to make decisions through trial and error using reward signals.",
        "metadata": {
            "category": "reinforcement_learning",
            "difficulty": "advanced",
            "topic": "decision_making",
            "word_count": 14,
            "author": "RL_Expert",
            "date": "2025-01-05"
        }
    }
]

print("📝 CRUD OPERATIONS DEMONSTRATION")
print("=" * 50)

# CREATE: Add documents
print("\n➕ CREATE Operation:")
for doc in sample_documents_with_metadata:
    success = vector_store.create_document(doc["id"], doc["text"], doc["metadata"])
    print(f"   Created {doc['id']}: {'✅' if success else '❌'}")

# READ: Retrieve specific documents
print("\n📖 READ Operation:")
retrieved_doc = vector_store.read_document("ml_001")
if retrieved_doc:
    print(f"   Retrieved: {retrieved_doc['id']}")
    print(f"   Text: {retrieved_doc['text'][:50]}...")
    print(f"   Category: {retrieved_doc['metadata']['category']}")
else:
    print("   ❌ Document not found")

# Collection stats
stats = vector_store.get_collection_stats()
print(f"\n📊 Collection Stats: {stats}")

In [None]:
# Advanced search with metadata filtering
print("\n🔍 ADVANCED SEARCH WITH FILTERING")
print("=" * 50)

# Search without filters
print("\n1. General Search (no filters):")
results = vector_store.search_with_filters("learning algorithms", top_k=3)
for i, result in enumerate(results):
    print(f"   {i+1}. {result['text'][:60]}... (score: {result['score']:.3f})")
    print(f"       Category: {result['metadata']['category']}, Difficulty: {result['metadata']['difficulty']}")

# Search with category filter
print("\n2. Filtered Search (category = 'deep_learning'):")
filtered_results = vector_store.search_with_filters(
    "neural networks",
    filters={"category": "deep_learning"},
    top_k=3
)
for i, result in enumerate(filtered_results):
    print(f"   {i+1}. {result['text'][:60]}... (score: {result['score']:.3f})")
    print(f"       Category: {result['metadata']['category']}")

# Search with difficulty filter
print("\n3. Filtered Search (difficulty = 'beginner'):")
beginner_results = vector_store.search_with_filters(
    "machine learning",
    filters={"difficulty": "beginner"},
    top_k=3
)
for i, result in enumerate(beginner_results):
    print(f"   {i+1}. {result['text'][:60]}... (score: {result['score']:.3f})")
    print(f"       Difficulty: {result['metadata']['difficulty']}")

In [None]:
# UPDATE and DELETE operations
print("\n🔄 UPDATE Operation:")
updated_text = "Machine learning algorithms use statistical methods to automatically learn complex patterns from large datasets."
update_success = vector_store.update_document(
    "ml_001", 
    text=updated_text, 
    metadata={"difficulty": "intermediate", "updated": True}
)
print(f"   Updated ml_001: {'✅' if update_success else '❌'}")

# Verify update
updated_doc = vector_store.read_document("ml_001")
if updated_doc:
    print(f"   New text: {updated_doc['text'][:50]}...")
    print(f"   New difficulty: {updated_doc['metadata']['difficulty']}")
    print(f"   Updated flag: {updated_doc['metadata'].get('updated', False)}")

print("\n❌ DELETE Operation:")
delete_success = vector_store.delete_document("cv_001")
print(f"   Deleted cv_001: {'✅' if delete_success else '❌'}")

# Verify deletion
deleted_doc = vector_store.read_document("cv_001")
print(f"   Document still exists: {'❌ No' if deleted_doc is None else '✅ Yes'}")

# Final collection stats
final_stats = vector_store.get_collection_stats()
print(f"\n📊 Final Collection Stats: {final_stats}")

## 🏎️ Exercise 4: Performance Optimization and Scaling

Let's explore performance optimization techniques for vector databases.

In [None]:
class PerformanceOptimizer:
    """Performance optimization and benchmarking for vector databases"""
    
    def __init__(self):
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.results = []
    
    def generate_synthetic_data(self, num_docs: int) -> Tuple[List[str], np.ndarray]:
        """Generate synthetic documents for performance testing"""
        topics = [
            "machine learning", "deep learning", "natural language processing",
            "computer vision", "data science", "artificial intelligence",
            "neural networks", "reinforcement learning", "robotics",
            "big data", "cloud computing", "cybersecurity"
        ]
        
        documents = []
        for i in range(num_docs):
            topic = topics[i % len(topics)]
            doc = f"This is document {i} about {topic} and its applications in modern technology. "
            doc += f"It covers various aspects of {topic} including implementation details and best practices."
            documents.append(doc)
        
        print(f"Generating embeddings for {num_docs} documents...")
        embeddings = self.embedding_model.encode(documents, show_progress_bar=True)
        
        return documents, embeddings
    
    def benchmark_insertion(self, database_configs: Dict[str, Any], documents: List[str], embeddings: np.ndarray) -> Dict[str, float]:
        """Benchmark insertion performance"""
        results = {}
        
        for db_name, config in database_configs.items():
            print(f"\nBenchmarking {db_name} insertion...")
            
            if db_name == "chroma":
                start_time = time.time()
                
                client = chromadb.Client()
                try:
                    collection = client.create_collection(
                        name=f"perf_test_{db_name}",
                        metadata={"hnsw:space": "cosine"}
                    )
                except:
                    client.delete_collection(f"perf_test_{db_name}")
                    collection = client.create_collection(
                        name=f"perf_test_{db_name}",
                        metadata={"hnsw:space": "cosine"}
                    )
                
                # Batch insertion
                batch_size = config.get('batch_size', 100)
                for i in range(0, len(documents), batch_size):
                    batch_docs = documents[i:i+batch_size]
                    batch_embeddings = embeddings[i:i+batch_size]
                    batch_ids = [f"doc_{j}" for j in range(i, min(i+batch_size, len(documents)))]
                    batch_metadata = [{"index": j, "batch": i//batch_size} for j in range(i, min(i+batch_size, len(documents)))]
                    
                    collection.add(
                        embeddings=batch_embeddings.tolist(),
                        documents=batch_docs,
                        ids=batch_ids,
                        metadatas=batch_metadata
                    )
                
                insertion_time = time.time() - start_time
                results[db_name] = insertion_time
            
            elif db_name == "qdrant":
                start_time = time.time()
                
                client = QdrantClient(":memory:")
                collection_name = f"perf_test_{db_name}"
                
                client.create_collection(
                    collection_name=collection_name,
                    vectors_config=VectorParams(size=embeddings.shape[1], distance=Distance.COSINE)
                )
                
                # Batch insertion
                batch_size = config.get('batch_size', 100)
                for i in range(0, len(documents), batch_size):
                    batch_docs = documents[i:i+batch_size]
                    batch_embeddings = embeddings[i:i+batch_size]
                    
                    points = [
                        PointStruct(
                            id=i+j,
                            vector=batch_embeddings[j].tolist(),
                            payload={
                                "text": batch_docs[j],
                                "index": i+j,
                                "batch": i//batch_size
                            }
                        )
                        for j in range(len(batch_docs))
                    ]
                    
                    client.upsert(collection_name=collection_name, points=points)
                
                insertion_time = time.time() - start_time
                results[db_name] = insertion_time
            
            elif db_name == "faiss":
                start_time = time.time()
                
                # Normalize for cosine similarity
                normalized_embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
                
                # Create and populate index
                index = faiss.IndexFlatIP(embeddings.shape[1])
                index.add(normalized_embeddings.astype('float32'))
                
                insertion_time = time.time() - start_time
                results[db_name] = insertion_time
        
        return results
    
    def benchmark_query_performance(self, num_queries: int = 100, num_docs: int = 1000) -> pd.DataFrame:
        """Comprehensive query performance benchmark"""
        print(f"\n🏎️ PERFORMANCE BENCHMARK")
        print(f"Documents: {num_docs}, Queries: {num_queries}")
        print("=" * 50)
        
        # Generate test data
        documents, embeddings = self.generate_synthetic_data(num_docs)
        
        # Database configurations
        configs = {
            "chroma": {"batch_size": 100},
            "qdrant": {"batch_size": 100},
            "faiss": {"batch_size": 1000}
        }
        
        # Benchmark insertion
        print("\nBenchmarking insertion performance...")
        insertion_times = self.benchmark_insertion(configs, documents, embeddings)
        
        # Generate test queries
        query_texts = [
            "machine learning algorithms", "deep neural networks", "data processing",
            "artificial intelligence", "computer vision tasks", "natural language"
        ]
        
        test_queries = [query_texts[i % len(query_texts)] for i in range(num_queries)]
        query_embeddings = self.embedding_model.encode(test_queries)
        
        # Benchmark query performance
        print("\nBenchmarking query performance...")
        
        performance_data = []
        
        for db_name in configs.keys():
            print(f"   Testing {db_name}...")
            
            query_times = []
            
            # Run multiple queries and measure time
            for i in range(min(10, num_queries)):  # Test first 10 queries for speed
                query_embedding = query_embeddings[i]
                
                start_time = time.time()
                
                if db_name == "chroma":
                    # Use the collection created during insertion benchmark
                    pass  # Would need to maintain collection reference
                elif db_name == "qdrant":
                    # Similar for qdrant
                    pass
                elif db_name == "faiss":
                    # FAISS search
                    normalized_query = query_embedding / np.linalg.norm(query_embedding)
                    index = faiss.IndexFlatIP(embeddings.shape[1])
                    normalized_embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
                    index.add(normalized_embeddings.astype('float32'))
                    _, _ = index.search(normalized_query.reshape(1, -1).astype('float32'), 5)
                
                query_time = time.time() - start_time
                query_times.append(query_time * 1000)  # Convert to ms
            
            # Calculate statistics
            avg_query_time = np.mean(query_times) if query_times else 0
            p95_query_time = np.percentile(query_times, 95) if query_times else 0
            
            performance_data.append({
                'Database': db_name.title(),
                'Documents': num_docs,
                'Insertion Time (s)': insertion_times.get(db_name, 0),
                'Docs/sec (insertion)': num_docs / insertion_times.get(db_name, 1),
                'Avg Query Time (ms)': avg_query_time,
                'P95 Query Time (ms)': p95_query_time,
                'Queries/sec': 1000 / avg_query_time if avg_query_time > 0 else 0
            })
        
        return pd.DataFrame(performance_data)

# Initialize performance optimizer
optimizer = PerformanceOptimizer()
print("⚡ Performance Optimizer initialized!")

In [None]:
# Run performance benchmark
performance_results = optimizer.benchmark_query_performance(num_queries=50, num_docs=500)

print("\n📊 PERFORMANCE RESULTS")
print("=" * 70)
print(performance_results.to_string(index=False, float_format='%.2f'))

# Visualize performance results
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))

# Insertion performance
ax1.bar(performance_results['Database'], performance_results['Docs/sec (insertion)'], 
        color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
ax1.set_title('Insertion Performance (Documents per Second)', fontweight='bold')
ax1.set_ylabel('Documents/Second')
ax1.tick_params(axis='x', rotation=45)

# Query latency
ax2.bar(performance_results['Database'], performance_results['Avg Query Time (ms)'], 
        color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
ax2.set_title('Average Query Latency', fontweight='bold')
ax2.set_ylabel('Time (milliseconds)')
ax2.tick_params(axis='x', rotation=45)

# Query throughput
ax3.bar(performance_results['Database'], performance_results['Queries/sec'], 
        color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
ax3.set_title('Query Throughput (Queries per Second)', fontweight='bold')
ax3.set_ylabel('Queries/Second')
ax3.tick_params(axis='x', rotation=45)

# P95 latency
ax4.bar(performance_results['Database'], performance_results['P95 Query Time (ms)'], 
        color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
ax4.set_title('P95 Query Latency', fontweight='bold')
ax4.set_ylabel('Time (milliseconds)')
ax4.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("\n💡 Performance Insights:")
print("- FAISS typically offers the best raw performance for similarity search")
print("- Chroma provides good balance of features and performance for development")
print("- Qdrant excels in production scenarios with complex filtering requirements")
print("- Consider your specific use case: prototyping vs production, filtering needs, etc.")

## 🧠 Exercise 5: Metadata Schema Design Best Practices

Let's explore effective metadata schema design for different use cases.

In [None]:
class MetadataSchemaDesigner:
    """Design and validate metadata schemas for different use cases"""
    
    def __init__(self):
        self.schemas = {}
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    
    def create_schema(self, schema_name: str, schema_definition: Dict[str, Any]) -> None:
        """Create a metadata schema"""
        self.schemas[schema_name] = schema_definition
        print(f"✅ Created schema: {schema_name}")
    
    def validate_document(self, document: Dict[str, Any], schema_name: str) -> Dict[str, Any]:
        """Validate a document against a schema"""
        if schema_name not in self.schemas:
            return {"valid": False, "errors": [f"Schema {schema_name} not found"]}
        
        schema = self.schemas[schema_name]
        errors = []
        
        # Check required fields
        required_fields = schema.get("required", [])
        for field in required_fields:
            if field not in document.get("metadata", {}):
                errors.append(f"Missing required field: {field}")
        
        # Check field types
        field_types = schema.get("properties", {})
        for field, expected_type in field_types.items():
            if field in document.get("metadata", {}):
                value = document["metadata"][field]
                if expected_type == "string" and not isinstance(value, str):
                    errors.append(f"Field {field} should be string, got {type(value).__name__}")
                elif expected_type == "number" and not isinstance(value, (int, float)):
                    errors.append(f"Field {field} should be number, got {type(value).__name__}")
                elif expected_type == "array" and not isinstance(value, list):
                    errors.append(f"Field {field} should be array, got {type(value).__name__}")
        
        return {"valid": len(errors) == 0, "errors": errors}
    
    def demonstrate_schemas(self) -> None:
        """Demonstrate different metadata schema patterns"""
        print("🏗️ METADATA SCHEMA DESIGN PATTERNS")
        print("=" * 50)
        
        # Schema 1: E-commerce Product Catalog
        ecommerce_schema = {
            "name": "E-commerce Product Catalog",
            "description": "Metadata for product descriptions and specifications",
            "required": ["product_id", "category", "price", "availability"],
            "properties": {
                "product_id": "string",
                "category": "string",
                "subcategory": "string",
                "brand": "string",
                "price": "number",
                "currency": "string",
                "availability": "string",
                "tags": "array",
                "rating": "number",
                "review_count": "number",
                "launch_date": "string",
                "is_featured": "boolean"
            },
            "filtering_strategy": {
                "primary_filters": ["category", "price", "availability", "brand"],
                "secondary_filters": ["rating", "tags", "is_featured"],
                "range_filters": ["price", "rating"]
            }
        }
        
        # Schema 2: Legal Document Management
        legal_schema = {
            "name": "Legal Document Management",
            "description": "Metadata for legal documents and contracts",
            "required": ["document_type", "jurisdiction", "date_created", "classification"],
            "properties": {
                "document_type": "string",
                "jurisdiction": "string",
                "practice_area": "string",
                "client_id": "string",
                "matter_id": "string",
                "date_created": "string",
                "date_modified": "string",
                "classification": "string",
                "confidentiality_level": "string",
                "parties": "array",
                "contract_value": "number",
                "expiry_date": "string",
                "status": "string"
            },
            "filtering_strategy": {
                "primary_filters": ["document_type", "jurisdiction", "practice_area", "classification"],
                "secondary_filters": ["client_id", "status", "confidentiality_level"],
                "date_filters": ["date_created", "date_modified", "expiry_date"]
            }
        }
        
        # Schema 3: Research Paper Database
        research_schema = {
            "name": "Research Paper Database",
            "description": "Metadata for academic papers and research articles",
            "required": ["title", "authors", "publication_date", "venue"],
            "properties": {
                "title": "string",
                "authors": "array",
                "publication_date": "string",
                "venue": "string",
                "venue_type": "string",
                "doi": "string",
                "arxiv_id": "string",
                "fields_of_study": "array",
                "keywords": "array",
                "citation_count": "number",
                "h_index": "number",
                "impact_factor": "number",
                "open_access": "boolean",
                "funding_sources": "array",
                "methodology": "string"
            },
            "filtering_strategy": {
                "primary_filters": ["fields_of_study", "venue_type", "open_access"],
                "secondary_filters": ["authors", "venue", "methodology"],
                "range_filters": ["citation_count", "impact_factor", "publication_date"]
            }
        }
        
        # Store schemas
        self.create_schema("ecommerce", ecommerce_schema)
        self.create_schema("legal", legal_schema)
        self.create_schema("research", research_schema)
        
        # Display schema summaries
        for schema_name, schema in self.schemas.items():
            print(f"\n📋 {schema['name']}:")
            print(f"   Description: {schema['description']}")
            print(f"   Required fields: {', '.join(schema['required'])}")
            print(f"   Total properties: {len(schema['properties'])}")
            print(f"   Primary filters: {', '.join(schema['filtering_strategy']['primary_filters'])}")
    
    def create_sample_documents(self) -> Dict[str, List[Dict]]:
        """Create sample documents for each schema"""
        samples = {
            "ecommerce": [
                {
                    "id": "prod_001",
                    "text": "High-performance wireless noise-canceling headphones with 30-hour battery life and premium sound quality.",
                    "metadata": {
                        "product_id": "WH-1000XM5",
                        "category": "Electronics",
                        "subcategory": "Headphones",
                        "brand": "Sony",
                        "price": 399.99,
                        "currency": "USD",
                        "availability": "in_stock",
                        "tags": ["wireless", "noise-canceling", "premium"],
                        "rating": 4.7,
                        "review_count": 1250,
                        "launch_date": "2023-05-15",
                        "is_featured": True
                    }
                },
                {
                    "id": "prod_002",
                    "text": "Ergonomic office chair with lumbar support and adjustable height for maximum comfort during long work sessions.",
                    "metadata": {
                        "product_id": "OFC-ERGO-2024",
                        "category": "Furniture",
                        "subcategory": "Office Chairs",
                        "brand": "ErgoMax",
                        "price": 249.99,
                        "currency": "USD",
                        "availability": "limited_stock",
                        "tags": ["ergonomic", "office", "adjustable"],
                        "rating": 4.3,
                        "review_count": 875,
                        "launch_date": "2024-01-10",
                        "is_featured": False
                    }
                }
            ],
            "legal": [
                {
                    "id": "legal_001",
                    "text": "Software licensing agreement between TechCorp and DataSoft for enterprise analytics platform usage.",
                    "metadata": {
                        "document_type": "License Agreement",
                        "jurisdiction": "California",
                        "practice_area": "Technology Law",
                        "client_id": "CLIENT_001",
                        "matter_id": "MAT_2024_001",
                        "date_created": "2024-01-15",
                        "date_modified": "2024-01-20",
                        "classification": "Commercial",
                        "confidentiality_level": "Confidential",
                        "parties": ["TechCorp Inc.", "DataSoft LLC"],
                        "contract_value": 150000.00,
                        "expiry_date": "2026-01-15",
                        "status": "Executed"
                    }
                }
            ],
            "research": [
                {
                    "id": "paper_001",
                    "text": "A comprehensive study on transformer architectures for natural language processing tasks, comparing performance across multiple benchmarks.",
                    "metadata": {
                        "title": "Transformer Architectures for NLP: A Comprehensive Analysis",
                        "authors": ["Dr. Jane Smith", "Prof. John Doe", "Dr. Alice Johnson"],
                        "publication_date": "2024-03-15",
                        "venue": "Journal of Machine Learning Research",
                        "venue_type": "Journal",
                        "doi": "10.1234/jmlr.2024.001",
                        "arxiv_id": "2403.001234",
                        "fields_of_study": ["Machine Learning", "Natural Language Processing"],
                        "keywords": ["transformers", "attention", "NLP", "benchmarks"],
                        "citation_count": 45,
                        "h_index": 12,
                        "impact_factor": 3.8,
                        "open_access": True,
                        "funding_sources": ["NSF", "Google Research"],
                        "methodology": "Experimental"
                    }
                }
            ]
        }
        
        return samples

# Initialize metadata schema designer
schema_designer = MetadataSchemaDesigner()
schema_designer.demonstrate_schemas()

print("\n🔧 Metadata Schema Designer ready!")

In [None]:
# Test schema validation
print("\n🔍 SCHEMA VALIDATION TESTING")
print("=" * 50)

sample_documents = schema_designer.create_sample_documents()

# Test each document against its schema
for schema_name, documents in sample_documents.items():
    print(f"\n📋 Testing {schema_name.title()} Schema:")
    
    for doc in documents:
        validation_result = schema_designer.validate_document(doc, schema_name)
        
        print(f"   Document {doc['id']}: {'✅ Valid' if validation_result['valid'] else '❌ Invalid'}")
        if not validation_result['valid']:
            for error in validation_result['errors']:
                print(f"      - {error}")

# Test invalid document
print("\n🧪 Testing Invalid Document:")
invalid_doc = {
    "id": "invalid_001",
    "text": "This is an invalid document",
    "metadata": {
        "product_id": "INVALID",
        "price": "not_a_number",  # Should be number
        # Missing required fields: category, availability
    }
}

validation_result = schema_designer.validate_document(invalid_doc, "ecommerce")
print(f"   Invalid document: {'✅ Valid' if validation_result['valid'] else '❌ Invalid'}")
for error in validation_result['errors']:
    print(f"      - {error}")

In [None]:
# Demonstrate filtering strategies
print("\n🎯 ADVANCED FILTERING STRATEGIES")
print("=" * 50)

# Set up vector store with sample documents
filter_demo_store = AdvancedVectorStore(db_type="chroma")

# Add sample documents to vector store
all_samples = []
for schema_name, documents in sample_documents.items():
    for doc in documents:
        doc['metadata']['schema_type'] = schema_name
        success = filter_demo_store.create_document(doc['id'], doc['text'], doc['metadata'])
        if success:
            all_samples.append(doc)

print(f"✅ Added {len(all_samples)} documents to vector store")

# Demonstrate different filtering strategies
filtering_examples = [
    {
        "name": "Category Filtering",
        "query": "high quality products",
        "filters": {"category": "Electronics"},
        "description": "Find electronics matching the query"
    },
    {
        "name": "Schema Type Filtering",
        "query": "research and analysis",
        "filters": {"schema_type": "research"},
        "description": "Find research documents only"
    },
    {
        "name": "Brand Filtering",
        "query": "premium audio equipment",
        "filters": {"brand": "Sony"},
        "description": "Find Sony products matching audio query"
    },
    {
        "name": "Boolean Filtering",
        "query": "featured products",
        "filters": {"is_featured": True},
        "description": "Find only featured products"
    }
]

for example in filtering_examples:
    print(f"\n🔍 {example['name']}:")
    print(f"   Description: {example['description']}")
    print(f"   Query: '{example['query']}'")
    print(f"   Filters: {example['filters']}")
    
    results = filter_demo_store.search_with_filters(
        example['query'], 
        filters=example['filters'], 
        top_k=3
    )
    
    if results:
        for i, result in enumerate(results):
            print(f"      {i+1}. {result['text'][:50]}... (score: {result['score']:.3f})")
            print(f"          ID: {result['id']}, Schema: {result['metadata'].get('schema_type')}")
    else:
        print("      No results found")

print("\n💡 Filtering Best Practices:")
print("- Use primary filters for high-selectivity attributes (category, type, status)")
print("- Combine multiple filters to narrow down results effectively")
print("- Index frequently filtered fields for better performance")
print("- Consider hierarchical metadata for complex organizational structures")
print("- Use range filters for numerical and date fields when appropriate")

## 🎯 Key Takeaways

From this module, you should now understand:

### ❌ Traditional Database Limitations:
1. **No semantic search**: Traditional databases can't understand meaning or similarity
2. **Inefficient high-dimensional indexing**: B-tree indexes don't work well for vectors
3. **Linear scan performance**: Similarity search requires checking every record
4. **No approximate algorithms**: Can't trade accuracy for speed in similarity search

### ✅ Vector Database Advantages:
1. **Optimized for similarity search**: Purpose-built for finding similar vectors quickly
2. **Approximate Nearest Neighbor algorithms**: HNSW, IVF, LSH for fast retrieval
3. **Metadata filtering**: Combine semantic similarity with traditional filters
4. **Horizontal scaling**: Built for production workloads with millions of vectors

### 🏗️ Database Selection Criteria:
1. **Development vs Production**: Chroma for prototyping, Pinecone/Qdrant for production
2. **Cost considerations**: FAISS (free) vs Pinecone (premium) vs Qdrant (middle ground)
3. **Feature requirements**: Advanced filtering, hybrid search, multi-tenancy
4. **Performance needs**: Query latency, insertion speed, concurrent users

### 📊 CRUD Operations:
1. **Create**: Efficient batch insertion with metadata
2. **Read**: Retrieve documents by ID or similarity
3. **Update**: Modify text content and metadata
4. **Delete**: Remove documents and clean up indexes

### 🎯 Metadata Design Patterns:
1. **Schema-based approach**: Define required fields and data types
2. **Hierarchical organization**: Use nested metadata for complex structures
3. **Filtering strategy**: Primary filters for high-selectivity, secondary for refinement
4. **Performance optimization**: Index frequently filtered fields

## 🔄 Vector Database Workflow:
1. **Schema Design** → 2. **Document Ingestion** → 3. **Index Building** → 4. **Query Processing** → 5. **Results Filtering**

## 🎯 Next Steps

In the next modules, we'll explore:
- **Module 7**: Indexing algorithms and performance optimization
- **Module 8**: Comparing different search methods (exact, approximate, hybrid)
- **Module 9**: Advanced retrieval strategies and re-ranking

Understanding vector databases is crucial for building production-ready RAG systems that can scale to millions of documents!

## 🤔 Discussion Questions

1. When would you choose a local vector database like Chroma vs a cloud solution like Pinecone?
2. How would you design a metadata schema for a multi-tenant SaaS application?
3. What are the trade-offs between exact and approximate similarity search?
4. How would you handle updates to documents in a production vector database?
5. What factors should influence your choice of similarity metric?

## 📝 Optional Exercises

1. **Performance Comparison**: Set up Pinecone or Weaviate cloud instances and compare with local databases
2. **Custom Metadata Schema**: Design a metadata schema for your domain (e.g., medical records, financial documents)
3. **Batch Operations**: Implement efficient batch insertion and update operations
4. **Monitoring and Metrics**: Add performance monitoring to track query latency and throughput
5. **Hybrid Search**: Combine vector similarity with traditional keyword search