# Vector Databases and Similarity Search

## Overview
This notebook covers production-grade vector database implementations using:
- **Milvus**: Distributed vector database for billion-scale searches
- **Qdrant**: High-performance vector search engine
- **FAISS**: Facebook's similarity search library

## Why Vector Databases?
- **Semantic Search**: Find similar items based on meaning, not keywords
- **Recommendation Systems**: Find products/content similar to user preferences
- **RAG Systems**: Retrieve relevant context for LLM prompts
- **Anomaly Detection**: Identify outliers in high-dimensional spaces

## Interview Focus
- Indexing strategies (IVF, HNSW, LSH)
- Distance metrics (L2, cosine, dot product)
- Scalability and sharding
- Production deployment patterns

In [None]:
# Installation
# pip install pymilvus qdrant-client faiss-cpu sentence-transformers

import torch
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
from typing import List, Tuple
import time

## Part 1: Embeddings with MiniLM

First, we'll generate embeddings using a production-grade model.

In [None]:
class EmbeddingService:
    """Production embedding service with caching and batch processing."""
    
    def __init__(self, model_name: str = 'sentence-transformers/all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.dimension = self.model.get_sentence_embedding_dimension()
        self.cache = {}
        
    def embed(self, texts: List[str], batch_size: int = 32) -> np.ndarray:
        """Generate embeddings with batching for efficiency."""
        embeddings = self.model.encode(
            texts,
            batch_size=batch_size,
            show_progress_bar=True,
            normalize_embeddings=True  # For cosine similarity
        )
        return embeddings
    
    def embed_query(self, query: str) -> np.ndarray:
        """Fast single query embedding with caching."""
        if query in self.cache:
            return self.cache[query]
        
        embedding = self.model.encode([query], normalize_embeddings=True)[0]
        self.cache[query] = embedding
        return embedding

# Initialize service
embedding_service = EmbeddingService()
print(f"Embedding dimension: {embedding_service.dimension}")

In [None]:
# Sample dataset: Product descriptions
documents = [
    "High-performance laptop with 32GB RAM and RTX 4090 GPU",
    "Wireless noise-cancelling headphones with 30-hour battery",
    "Ergonomic office chair with lumbar support and adjustable height",
    "4K gaming monitor with 144Hz refresh rate and HDR",
    "Mechanical keyboard with RGB backlighting and Cherry MX switches",
    "Portable SSD with 2TB storage and USB-C connectivity",
    "Standing desk converter with electric height adjustment",
    "Webcam with 1080p resolution and auto-focus capability",
    "Docking station with dual 4K display support",
    "Smartphone with 256GB storage and 5G connectivity"
] * 100  # Replicate for larger dataset

print(f"Total documents: {len(documents)}")

# Generate embeddings
embeddings = embedding_service.embed(documents)
print(f"Embeddings shape: {embeddings.shape}")

## Part 2: FAISS - Fast Similarity Search

FAISS is Meta's library optimized for billion-scale similarity search.

In [None]:
class FAISSVectorStore:
    """Production FAISS implementation with multiple index types."""
    
    def __init__(self, dimension: int, index_type: str = 'flat'):
        self.dimension = dimension
        self.index_type = index_type
        self.index = self._create_index()
        self.documents = []
        
    def _create_index(self):
        """Create appropriate FAISS index based on scale and requirements."""
        if self.index_type == 'flat':
            # Exact search, best for <1M vectors
            return faiss.IndexFlatIP(self.dimension)  # Inner Product (cosine if normalized)
        
        elif self.index_type == 'ivf':
            # Inverted File Index, good for 1M-10M vectors
            quantizer = faiss.IndexFlatIP(self.dimension)
            nlist = 100  # Number of clusters
            return faiss.IndexIVFFlat(quantizer, self.dimension, nlist)
        
        elif self.index_type == 'hnsw':
            # Hierarchical Navigable Small World, best for <10M vectors
            M = 32  # Number of connections
            return faiss.IndexHNSWFlat(self.dimension, M)
        
        else:
            raise ValueError(f"Unknown index type: {self.index_type}")
    
    def add(self, embeddings: np.ndarray, documents: List[str]):
        """Add vectors to the index."""
        embeddings = embeddings.astype('float32')
        
        if self.index_type == 'ivf' and not self.index.is_trained:
            print("Training IVF index...")
            self.index.train(embeddings)
        
        self.index.add(embeddings)
        self.documents.extend(documents)
        print(f"Index now contains {self.index.ntotal} vectors")
    
    def search(self, query_embedding: np.ndarray, k: int = 5) -> List[Tuple[str, float]]:
        """Search for top-k similar documents."""
        query_embedding = query_embedding.reshape(1, -1).astype('float32')
        
        if self.index_type == 'ivf':
            self.index.nprobe = 10  # Number of clusters to search
        
        scores, indices = self.index.search(query_embedding, k)
        
        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx != -1:  # Valid result
                results.append((self.documents[idx], float(score)))
        
        return results
    
    def benchmark(self, query_embedding: np.ndarray, k: int = 5, n_queries: int = 100):
        """Benchmark search performance."""
        start = time.time()
        for _ in range(n_queries):
            self.search(query_embedding, k)
        elapsed = time.time() - start
        
        qps = n_queries / elapsed
        latency_ms = (elapsed / n_queries) * 1000
        
        return {
            'qps': qps,
            'latency_ms': latency_ms,
            'total_time': elapsed
        }

# Test FAISS with different index types
print("\n=== FAISS Flat Index ===")
faiss_flat = FAISSVectorStore(embedding_service.dimension, 'flat')
faiss_flat.add(embeddings, documents)

query = "laptop with powerful GPU for machine learning"
query_emb = embedding_service.embed_query(query)

results = faiss_flat.search(query_emb, k=3)
print("\nTop 3 results:")
for doc, score in results:
    print(f"  Score: {score:.4f} | {doc}")

# Benchmark
metrics = faiss_flat.benchmark(query_emb)
print(f"\nPerformance: {metrics['qps']:.2f} QPS, {metrics['latency_ms']:.2f}ms latency")

## Part 3: Milvus - Distributed Vector Database

Milvus is designed for production deployments with billion-scale vectors.

In [None]:
from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType, utility

class MilvusVectorStore:
    """Production Milvus implementation with auto-scaling and monitoring."""
    
    def __init__(self, collection_name: str, dimension: int, host: str = 'localhost', port: int = 19530):
        self.collection_name = collection_name
        self.dimension = dimension
        
        # Connect to Milvus
        connections.connect(alias="default", host=host, port=port)
        
        # Create collection if not exists
        self._create_collection()
    
    def _create_collection(self):
        """Create collection with optimized schema."""
        if utility.has_collection(self.collection_name):
            self.collection = Collection(self.collection_name)
            print(f"Loaded existing collection: {self.collection_name}")
            return
        
        # Define schema
        fields = [
            FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
            FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=self.dimension),
            FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=1000)
        ]
        schema = CollectionSchema(fields, description="Product search collection")
        
        self.collection = Collection(self.collection_name, schema)
        print(f"Created collection: {self.collection_name}")
        
        # Create index for fast search
        index_params = {
            "metric_type": "IP",  # Inner Product (cosine for normalized vectors)
            "index_type": "IVF_FLAT",
            "params": {"nlist": 128}
        }
        self.collection.create_index(field_name="embedding", index_params=index_params)
        print("Created index")
    
    def add(self, embeddings: np.ndarray, documents: List[str]):
        """Batch insert with automatic partitioning."""
        data = [
            embeddings.tolist(),
            documents
        ]
        
        self.collection.insert(data)
        self.collection.flush()
        print(f"Inserted {len(documents)} documents")
    
    def search(self, query_embedding: np.ndarray, k: int = 5, filter_expr: str = None) -> List[Tuple[str, float]]:
        """Hybrid search with optional filtering."""
        self.collection.load()
        
        search_params = {
            "metric_type": "IP",
            "params": {"nprobe": 16}
        }
        
        results = self.collection.search(
            data=[query_embedding.tolist()],
            anns_field="embedding",
            param=search_params,
            limit=k,
            expr=filter_expr,
            output_fields=["text"]
        )
        
        output = []
        for hit in results[0]:
            output.append((hit.entity.get('text'), hit.score))
        
        return output
    
    def get_stats(self):
        """Get collection statistics."""
        self.collection.flush()
        return {
            'num_entities': self.collection.num_entities,
            'index_type': 'IVF_FLAT',
            'metric_type': 'IP'
        }
    
    def cleanup(self):
        """Drop collection and disconnect."""
        utility.drop_collection(self.collection_name)
        connections.disconnect("default")

# Example usage (requires Milvus server)
# Uncomment to run with Milvus:
# milvus_store = MilvusVectorStore('products', embedding_service.dimension)
# milvus_store.add(embeddings[:100], documents[:100])
# results = milvus_store.search(query_emb, k=5)
# print(milvus_store.get_stats())

print("\nMilvus example ready (requires running Milvus server)")
print("Start Milvus: docker run -d --name milvus -p 19530:19530 milvusdb/milvus:latest")

## Part 4: Qdrant - High-Performance Vector Search

Qdrant provides filtering, payloads, and easy deployment.

In [None]:
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue

class QdrantVectorStore:
    """Production Qdrant implementation with filtering and metadata."""
    
    def __init__(self, collection_name: str, dimension: int, in_memory: bool = True):
        self.collection_name = collection_name
        self.dimension = dimension
        
        # Initialize client (in-memory or server)
        if in_memory:
            self.client = QdrantClient(":memory:")
        else:
            self.client = QdrantClient(host="localhost", port=6333)
        
        # Create collection
        self._create_collection()
    
    def _create_collection(self):
        """Create collection with HNSW index."""
        self.client.recreate_collection(
            collection_name=self.collection_name,
            vectors_config=VectorParams(
                size=self.dimension,
                distance=Distance.COSINE
            )
        )
        print(f"Created Qdrant collection: {self.collection_name}")
    
    def add(self, embeddings: np.ndarray, documents: List[str], metadata: List[dict] = None):
        """Add vectors with rich metadata for filtering."""
        points = []
        for idx, (emb, doc) in enumerate(zip(embeddings, documents)):
            payload = {"text": doc}
            if metadata and idx < len(metadata):
                payload.update(metadata[idx])
            
            points.append(
                PointStruct(
                    id=idx,
                    vector=emb.tolist(),
                    payload=payload
                )
            )
        
        self.client.upsert(
            collection_name=self.collection_name,
            points=points
        )
        print(f"Added {len(points)} points to Qdrant")
    
    def search(self, query_embedding: np.ndarray, k: int = 5, filter_conditions: dict = None) -> List[Tuple[str, float]]:
        """Semantic search with optional metadata filtering."""
        query_filter = None
        if filter_conditions:
            # Example: {"category": "electronics"}
            conditions = []
            for key, value in filter_conditions.items():
                conditions.append(
                    FieldCondition(key=key, match=MatchValue(value=value))
                )
            query_filter = Filter(must=conditions)
        
        results = self.client.search(
            collection_name=self.collection_name,
            query_vector=query_embedding.tolist(),
            limit=k,
            query_filter=query_filter
        )
        
        return [(hit.payload['text'], hit.score) for hit in results]
    
    def get_stats(self):
        """Get collection info."""
        info = self.client.get_collection(self.collection_name)
        return {
            'vectors_count': info.vectors_count,
            'indexed_vectors_count': info.indexed_vectors_count,
            'points_count': info.points_count
        }

# Test Qdrant
print("\n=== Qdrant Vector Store ===")
qdrant_store = QdrantVectorStore('products', embedding_service.dimension)

# Add with metadata
metadata = [{"category": "electronics", "price": 1200 + i} for i in range(len(documents[:100]))]
qdrant_store.add(embeddings[:100], documents[:100], metadata)

# Search
results = qdrant_store.search(query_emb, k=3)
print("\nTop 3 results:")
for doc, score in results:
    print(f"  Score: {score:.4f} | {doc}")

# Filtered search
print("\nFiltered search (category=electronics):")
filtered_results = qdrant_store.search(query_emb, k=3, filter_conditions={"category": "electronics"})
for doc, score in filtered_results:
    print(f"  Score: {score:.4f} | {doc}")

print(f"\nCollection stats: {qdrant_store.get_stats()}")

## Part 5: Comparison and Best Practices

### When to Use Each:

**FAISS:**
- ✅ Best for: <10M vectors, research, prototyping
- ✅ Fastest for exact search
- ❌ No built-in persistence, filtering, or metadata
- ❌ Requires custom infrastructure

**Milvus:**
- ✅ Best for: 10M-1B+ vectors, enterprise scale
- ✅ Distributed, auto-scaling, high availability
- ✅ Hybrid search (vector + scalar filtering)
- ❌ More complex to deploy and maintain

**Qdrant:**
- ✅ Best for: 1M-100M vectors, developer-friendly
- ✅ Rich filtering, easy deployment, great DX
- ✅ Good balance of features and simplicity
- ❌ Less proven at billion-scale

### Production Checklist:
1. **Indexing Strategy**: Choose based on scale and accuracy needs
2. **Monitoring**: Track QPS, latency (P50/P95/P99), recall@k
3. **Sharding**: Partition data for horizontal scaling
4. **Caching**: Cache embeddings and frequent queries
5. **Backups**: Regular snapshots and point-in-time recovery
6. **Security**: Authentication, encryption at rest/transit

In [None]:
# Performance comparison
import pandas as pd

def compare_vector_stores():
    """Benchmark all three stores."""
    results = []
    
    # FAISS Flat
    faiss_metrics = faiss_flat.benchmark(query_emb, k=5, n_queries=100)
    results.append({
        'Store': 'FAISS (Flat)',
        'QPS': f"{faiss_metrics['qps']:.1f}",
        'Latency (ms)': f"{faiss_metrics['latency_ms']:.2f}",
        'Index Type': 'Exact',
        'Best For': '<1M vectors'
    })
    
    # FAISS HNSW
    faiss_hnsw = FAISSVectorStore(embedding_service.dimension, 'hnsw')
    faiss_hnsw.add(embeddings, documents)
    hnsw_metrics = faiss_hnsw.benchmark(query_emb, k=5, n_queries=100)
    results.append({
        'Store': 'FAISS (HNSW)',
        'QPS': f"{hnsw_metrics['qps']:.1f}",
        'Latency (ms)': f"{hnsw_metrics['latency_ms']:.2f}",
        'Index Type': 'Approximate',
        'Best For': '1M-10M vectors'
    })
    
    # Qdrant
    start = time.time()
    for _ in range(100):
        qdrant_store.search(query_emb, k=5)
    qdrant_time = time.time() - start
    results.append({
        'Store': 'Qdrant',
        'QPS': f"{100/qdrant_time:.1f}",
        'Latency (ms)': f"{(qdrant_time/100)*1000:.2f}",
        'Index Type': 'HNSW',
        'Best For': '1M-100M vectors'
    })
    
    df = pd.DataFrame(results)
    return df

print("\n=== Performance Comparison ===")
comparison = compare_vector_stores()
print(comparison.to_string(index=False))

## Part 6: Advanced Techniques

### Hybrid Search
Combine vector similarity with traditional filters for better results.

In [None]:
class HybridSearchEngine:
    """Production hybrid search combining vectors + metadata."""
    
    def __init__(self, vector_store: QdrantVectorStore):
        self.vector_store = vector_store
    
    def search(
        self,
        query: str,
        embedding_service: EmbeddingService,
        k: int = 10,
        filters: dict = None,
        rerank: bool = True
    ) -> List[Tuple[str, float]]:
        """Multi-stage search with optional reranking."""
        
        # Stage 1: Vector search with filters
        query_emb = embedding_service.embed_query(query)
        results = self.vector_store.search(
            query_emb,
            k=k*2 if rerank else k,  # Fetch more for reranking
            filter_conditions=filters
        )
        
        if not rerank:
            return results[:k]
        
        # Stage 2: Rerank with cross-encoder (more accurate but slower)
        # In production, use a cross-encoder model here
        # For now, we'll just return top-k
        return results[:k]
    
    def explain_ranking(self, query: str, doc: str, embedding_service: EmbeddingService) -> dict:
        """Explain why a document was ranked for a query."""
        query_emb = embedding_service.embed_query(query)
        doc_emb = embedding_service.embed_query(doc)
        
        # Cosine similarity
        similarity = np.dot(query_emb, doc_emb) / (np.linalg.norm(query_emb) * np.linalg.norm(doc_emb))
        
        return {
            'query': query,
            'document': doc,
            'similarity': float(similarity),
            'explanation': f"High overlap in semantic meaning (score: {similarity:.3f})"
        }

# Example: Hybrid search
hybrid_engine = HybridSearchEngine(qdrant_store)
results = hybrid_engine.search(
    "affordable laptop for students",
    embedding_service,
    k=5,
    filters={"category": "electronics"}
)

print("\n=== Hybrid Search Results ===")
for doc, score in results:
    print(f"Score: {score:.4f} | {doc}")

# Explain ranking
explanation = hybrid_engine.explain_ranking(
    "gaming laptop",
    "High-performance laptop with 32GB RAM and RTX 4090 GPU",
    embedding_service
)
print(f"\nRanking explanation: {explanation}")

## Key Takeaways

1. **Choose the right tool**: FAISS for prototypes, Qdrant for SMB, Milvus for enterprise
2. **Index selection matters**: Flat for accuracy, HNSW for speed, IVF for scale
3. **Monitor performance**: Track QPS, latency, and recall@k
4. **Use hybrid search**: Combine vectors with metadata filtering
5. **Plan for scale**: Shard data, cache embeddings, use batch operations

## Interview Questions

1. **Explain the trade-off between HNSW and IVF indexes.**
   - HNSW: Graph-based, O(log n) search, better recall, more memory
   - IVF: Cluster-based, faster indexing, less memory, lower recall

2. **How do you handle billion-scale vector search?**
   - Shard across nodes, use approximate indexes (IVF/HNSW)
   - Quantization (PQ/SQ) to reduce memory
   - Two-stage search: coarse filter → fine ranking

3. **What's the difference between L2 and cosine similarity?**
   - L2: Euclidean distance, sensitive to magnitude
   - Cosine: Angle-based, normalized, better for text
   - For normalized vectors, IP (inner product) ≈ cosine

4. **How do you monitor vector database performance?**
   - QPS (queries per second)
   - Latency percentiles (P50, P95, P99)
   - Recall@k (accuracy vs ground truth)
   - Index build time and memory usage