# Module 5: Working with Embeddings

## 🎯 Learning Objectives
By the end of this module, you will:
- Generate and manipulate text embeddings programmatically
- Understand and implement different similarity metrics
- Visualize high-dimensional embedding spaces in 2D/3D
- Build efficient semantic search functionality
- Explore embedding neighborhoods and relationships
- Implement batch processing for large datasets
- Use 2025's latest visualization tools

## 📚 Key Concepts

### What Are Embeddings Really? 🔢

**Think of embeddings as coordinates in meaning-space:**
- Each dimension captures some aspect of meaning
- Similar texts have similar coordinates
- Distance between points = semantic similarity
- We can navigate this space mathematically

### Similarity Metrics Comparison 📏
| Metric | Formula | When to Use | Range |
|--------|---------|-------------|-------|
| **Cosine** | cos(θ) = A·B / (\|A\|\|B\|) | Text, normalized vectors | [-1, 1] |
| **Dot Product** | A·B | Unit vectors, fast computation | [-∞, ∞] |
| **Euclidean** | √Σ(Ai-Bi)² | When magnitude matters | [0, ∞] |
| **Manhattan** | Σ\|Ai-Bi\| | Robust to outliers | [0, ∞] |

### 2025 Visualization Advances 🎨
- **UMAP dominance**: Faster, preserves global + local structure
- **Interactive exploration**: Nomic Atlas for massive datasets
- **3D visualization**: Real-time navigation of embedding spaces
- **Contextual embeddings**: Dynamic embeddings based on context

### Performance Optimization 🚀
- **Batch processing**: Process multiple texts simultaneously
- **Vector operations**: Use NumPy/PyTorch for speed
- **Memory management**: Handle large embedding datasets efficiently
- **Caching strategies**: Store and reuse embeddings


## 🛠️ Setup
Let's install the required packages for hands-on embedding work.

In [None]:
# Install required packages
!pip install -q sentence-transformers numpy matplotlib plotly
!pip install -q scikit-learn umap-learn seaborn
!pip install -q torch torchvision  # For efficient operations
!pip install -q pandas ipywidgets  # For interactive widgets
# Optional: !pip install -q nomic  # For advanced visualization

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import time
import warnings
warnings.filterwarnings('ignore')

# Embedding and ML libraries
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import umap.umap_ as umap

# Torch for efficient operations
import torch

print("✅ Setup complete!")
print("🔢 Ready to work with embeddings!")
print(f"🚀 Using device: {'CUDA' if torch.cuda.is_available() else 'CPU'}")

## 🔢 Exercise 1: Generating and Inspecting Embeddings

Let's start by generating embeddings and understanding their structure.

In [None]:
# Load a good general-purpose embedding model
print("🤖 Loading embedding model...")
model = SentenceTransformer('all-mpnet-base-v2')  # 768 dimensions, good performance
print(f"   Model: {model._modules['0'].auto_model.name_or_path}")
print(f"   Dimensions: {model.get_sentence_embedding_dimension()}")
print(f"   Max sequence length: {model.max_seq_length}")

# Sample texts covering different topics and styles
sample_texts = [
    "Machine learning algorithms require large datasets for training",
    "Deep learning models need substantial data to learn effectively",  # Similar to above
    "The cat sat peacefully on the warm windowsill",
    "A feline rested comfortably near the sunny window",  # Similar to above
    "Financial markets experienced significant volatility yesterday",
    "Stock prices fluctuated wildly during the trading session",  # Similar to above
    "Python is a versatile programming language",
    "The snake slithered silently through the tall grass",  # Same word, different meaning
    "Climate change affects global weather patterns",
    "Artificial neural networks mimic biological brain functions"
]

print(f"\n📝 Sample texts ({len(sample_texts)} items):")
for i, text in enumerate(sample_texts, 1):
    print(f"   {i:2d}. {text}")

In [None]:
# Generate embeddings
print("\n🔄 Generating embeddings...")
start_time = time.time()
embeddings = model.encode(sample_texts, convert_to_numpy=True)
end_time = time.time()

print(f"   ✅ Generated {embeddings.shape[0]} embeddings in {end_time - start_time:.3f} seconds")
print(f"   📐 Shape: {embeddings.shape}")
print(f"   💾 Memory usage: {embeddings.nbytes / 1024:.1f} KB")
print(f"   📊 Data type: {embeddings.dtype}")

# Inspect embedding properties
print(f"\n🔍 Embedding Analysis:")
print(f"   Value range: [{embeddings.min():.3f}, {embeddings.max():.3f}]")
print(f"   Mean: {embeddings.mean():.3f}")
print(f"   Standard deviation: {embeddings.std():.3f}")
print(f"   Sparsity: {(embeddings == 0).mean()*100:.1f}% zeros")

# Show first embedding sample
print(f"\n📋 First embedding (first 10 dimensions):")
print(f"   Text: '{sample_texts[0]}'")
print(f"   Vector: {embeddings[0][:10]}")
print(f"   Norm (magnitude): {np.linalg.norm(embeddings[0]):.3f}")

In [None]:
# Visualize embedding distribution
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

# 1. Distribution of values across all embeddings
all_values = embeddings.flatten()
ax1.hist(all_values, bins=50, alpha=0.7, color='skyblue', edgecolor='black')
ax1.set_title('Distribution of All Embedding Values')
ax1.set_xlabel('Embedding Value')
ax1.set_ylabel('Frequency')
ax1.axvline(all_values.mean(), color='red', linestyle='--', label=f'Mean: {all_values.mean():.3f}')
ax1.legend()

# 2. Embedding norms (magnitudes)
norms = np.linalg.norm(embeddings, axis=1)
ax2.bar(range(len(norms)), norms, color='lightgreen', alpha=0.7)
ax2.set_title('Embedding Magnitudes')
ax2.set_xlabel('Text Index')
ax2.set_ylabel('L2 Norm')
ax2.axhline(norms.mean(), color='red', linestyle='--', label=f'Mean: {norms.mean():.3f}')
ax2.legend()

# 3. Heatmap of first few embeddings
sns.heatmap(embeddings[:5, :50], annot=False, cmap='RdBu_r', center=0, ax=ax3)
ax3.set_title('First 5 Embeddings (First 50 Dimensions)')
ax3.set_xlabel('Dimension')
ax3.set_ylabel('Text Index')

# 4. Dimension variance
dim_variance = np.var(embeddings, axis=0)
ax4.plot(dim_variance, alpha=0.7, color='purple')
ax4.set_title('Variance per Dimension')
ax4.set_xlabel('Dimension Index')
ax4.set_ylabel('Variance')
ax4.axhline(dim_variance.mean(), color='red', linestyle='--', 
           label=f'Mean Variance: {dim_variance.mean():.6f}')
ax4.legend()

plt.tight_layout()
plt.show()

print("\n📊 Key Observations:")
print(f"   • Values roughly follow normal distribution around 0")
print(f"   • Similar magnitude across different texts ({norms.std():.3f} std dev)")
print(f"   • All dimensions contribute (no completely zero dimensions)")
print(f"   • Dense representation (very few zero values)")

## 📏 Exercise 2: Understanding Similarity Metrics

Let's explore different ways to measure similarity between embeddings.

In [None]:
def calculate_all_similarities(embeddings):
    """
    Calculate similarity using different metrics
    """
    # Cosine similarity (most common for text)
    cosine_sim = cosine_similarity(embeddings)
    
    # Dot product (for normalized vectors, same as cosine)
    dot_product_sim = np.dot(embeddings, embeddings.T)
    
    # Euclidean distance (convert to similarity)
    euclidean_dist = euclidean_distances(embeddings)
    euclidean_sim = 1 / (1 + euclidean_dist)  # Convert distance to similarity
    
    # Manhattan distance (L1 norm)
    manhattan_dist = np.sum(np.abs(embeddings[:, np.newaxis] - embeddings), axis=2)
    manhattan_sim = 1 / (1 + manhattan_dist)
    
    return {
        'cosine': cosine_sim,
        'dot_product': dot_product_sim,
        'euclidean': euclidean_sim,
        'manhattan': manhattan_sim
    }

# Calculate similarities
print("📏 SIMILARITY METRICS COMPARISON")
print("=" * 40)

similarities = calculate_all_similarities(embeddings)

# Compare specific text pairs
interesting_pairs = [
    (0, 1, "ML algorithms vs Deep learning (similar meaning)"),
    (2, 3, "Cat on windowsill vs Feline near window (similar meaning)"),
    (4, 5, "Financial volatility vs Stock fluctuation (similar meaning)"),
    (6, 7, "Python programming vs Snake in grass (different meaning)"),
    (0, 2, "ML algorithms vs Cat on windowsill (unrelated)"),
    (8, 9, "Climate change vs Neural networks (unrelated)")
]

print("\n🔍 Similarity Comparison for Key Pairs:")
print(f"{'Pair':<50} {'Cosine':<8} {'Dot Prod':<9} {'Euclidean':<10} {'Manhattan':<10}")
print("-" * 87)

for i, j, description in interesting_pairs:
    cosine = similarities['cosine'][i][j]
    dot_prod = similarities['dot_product'][i][j]
    euclidean = similarities['euclidean'][i][j]
    manhattan = similarities['manhattan'][i][j]
    
    print(f"{description:<50} {cosine:<8.3f} {dot_prod:<9.3f} {euclidean:<10.3f} {manhattan:<10.3f}")

In [None]:
# Visualize similarity matrices
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()

metrics = ['cosine', 'dot_product', 'euclidean', 'manhattan']
titles = ['Cosine Similarity', 'Dot Product', 'Euclidean Similarity', 'Manhattan Similarity']
colormaps = ['Blues', 'Reds', 'Greens', 'Purples']

for idx, (metric, title, cmap) in enumerate(zip(metrics, titles, colormaps)):
    sim_matrix = similarities[metric]
    
    # Create heatmap
    sns.heatmap(sim_matrix, 
                annot=True, 
                fmt='.2f', 
                cmap=cmap, 
                ax=axes[idx],
                square=True,
                cbar_kws={'label': 'Similarity'})
    
    axes[idx].set_title(f'{title}\n(Range: {sim_matrix.min():.2f} to {sim_matrix.max():.2f})')
    axes[idx].set_xlabel('Text Index')
    axes[idx].set_ylabel('Text Index')

plt.tight_layout()
plt.show()

# Analyze correlations between metrics
print("\n📊 Correlation Between Similarity Metrics:")
correlation_data = {}
for metric_name, sim_matrix in similarities.items():
    # Get upper triangle (excluding diagonal)
    mask = np.triu(np.ones_like(sim_matrix, dtype=bool), k=1)
    correlation_data[metric_name] = sim_matrix[mask]

corr_df = pd.DataFrame(correlation_data)
correlation_matrix = corr_df.corr()
print(correlation_matrix)

print("\n💡 Key Insights:")
print("   • Cosine similarity focuses on direction, not magnitude")
print("   • Dot product includes magnitude information")
print("   • Euclidean considers absolute distance in space")
print("   • For normalized embeddings, cosine ≈ dot product")

## 🎨 Exercise 3: Visualizing Embedding Spaces

Let's make the high-dimensional embeddings visible using dimensionality reduction.

In [None]:
# Prepare data for visualization
print("🎨 EMBEDDING SPACE VISUALIZATION")
print("=" * 35)

# Create categories for coloring
categories = [
    'ML/AI', 'ML/AI',           # 0, 1: Machine learning texts
    'Animals', 'Animals',       # 2, 3: Cat texts  
    'Finance', 'Finance',       # 4, 5: Financial texts
    'Programming', 'Animals',   # 6, 7: Python programming vs snake
    'Environment', 'ML/AI'      # 8, 9: Climate change vs neural networks
]

colors = ['red', 'red', 'blue', 'blue', 'green', 'green', 'orange', 'blue', 'purple', 'red']

# Method 1: PCA (Principal Component Analysis)
print("\n🔄 Applying PCA...")
pca = PCA(n_components=2, random_state=42)
pca_embeddings = pca.fit_transform(embeddings)

print(f"   Explained variance: {pca.explained_variance_ratio_.sum():.3f}")
print(f"   PC1: {pca.explained_variance_ratio_[0]:.3f}, PC2: {pca.explained_variance_ratio_[1]:.3f}")

# Method 2: UMAP (more advanced, preserves local and global structure)
print("\n🔄 Applying UMAP...")
umap_reducer = umap.UMAP(
    n_components=2,
    metric='cosine',  # Use cosine distance for text embeddings
    n_neighbors=5,    # Small number for small dataset
    min_dist=0.1,
    random_state=42
)
umap_embeddings = umap_reducer.fit_transform(embeddings)

print("   ✅ UMAP transformation complete")

In [None]:
# Create interactive visualization
def create_interactive_embedding_plot(embeddings_2d, method_name, texts, categories):
    """
    Create an interactive plot of 2D embeddings
    """
    fig = go.Figure()
    
    # Group by category for legend
    unique_categories = list(set(categories))
    category_colors = {
        'ML/AI': 'red',
        'Animals': 'blue', 
        'Finance': 'green',
        'Programming': 'orange',
        'Environment': 'purple'
    }
    
    for category in unique_categories:
        # Get indices for this category
        indices = [i for i, cat in enumerate(categories) if cat == category]
        
        if indices:
            fig.add_trace(go.Scatter(
                x=embeddings_2d[indices, 0],
                y=embeddings_2d[indices, 1],
                mode='markers+text',
                name=category,
                text=[str(i) for i in indices],
                textposition='top center',
                hovertext=[f"{i}: {texts[i][:50]}..." for i in indices],
                hoverinfo='text',
                marker=dict(
                    size=12,
                    color=category_colors.get(category, 'gray'),
                    opacity=0.8,
                    line=dict(color='black', width=1)
                )
            ))
    
    fig.update_layout(
        title=f'{method_name} Embedding Visualization<br><sub>Hover for full text, numbers are text indices</sub>',
        xaxis_title=f'{method_name} Dimension 1',
        yaxis_title=f'{method_name} Dimension 2',
        width=800,
        height=600,
        showlegend=True
    )
    
    return fig

# Create PCA visualization
pca_fig = create_interactive_embedding_plot(pca_embeddings, 'PCA', sample_texts, categories)
pca_fig.show()

# Create UMAP visualization  
umap_fig = create_interactive_embedding_plot(umap_embeddings, 'UMAP', sample_texts, categories)
umap_fig.show()

print("\n🎯 Visualization Insights:")
print("   • Similar texts cluster together in 2D space")
print("   • UMAP often preserves local neighborhoods better than PCA")
print("   • Distance in 2D approximates semantic similarity")
print("   • Outliers may represent unique or ambiguous content")

In [None]:
# 3D visualization for even richer exploration
print("\n🌐 Creating 3D UMAP Visualization...")

# Generate 3D UMAP
umap_3d = umap.UMAP(
    n_components=3,
    metric='cosine',
    n_neighbors=5,
    min_dist=0.1,
    random_state=42
).fit_transform(embeddings)

# Create 3D scatter plot
fig_3d = go.Figure()

unique_categories = list(set(categories))
category_colors = {
    'ML/AI': 'red',
    'Animals': 'blue', 
    'Finance': 'green',
    'Programming': 'orange',
    'Environment': 'purple'
}

for category in unique_categories:
    indices = [i for i, cat in enumerate(categories) if cat == category]
    
    if indices:
        fig_3d.add_trace(go.Scatter3d(
            x=umap_3d[indices, 0],
            y=umap_3d[indices, 1], 
            z=umap_3d[indices, 2],
            mode='markers+text',
            name=category,
            text=[str(i) for i in indices],
            hovertext=[f"{i}: {sample_texts[i][:60]}..." for i in indices],
            hoverinfo='text',
            marker=dict(
                size=8,
                color=category_colors.get(category, 'gray'),
                opacity=0.8,
                line=dict(color='black', width=1)
            )
        ))

fig_3d.update_layout(
    title='3D UMAP Embedding Visualization<br><sub>Rotate and zoom to explore the space</sub>',
    scene=dict(
        xaxis_title='UMAP Dimension 1',
        yaxis_title='UMAP Dimension 2',
        zaxis_title='UMAP Dimension 3'
    ),
    width=900,
    height=700
)

fig_3d.show()

print("✅ 3D visualization created! Rotate and zoom to explore the embedding space.")

## 🔍 Exercise 4: Building Semantic Search

Now let's build a practical semantic search system using our embeddings.

In [None]:
class SemanticSearch:
    def __init__(self, documents, model):
        """
        Initialize semantic search with documents and embedding model
        """
        self.documents = documents
        self.model = model
        self.embeddings = None
        self.index_documents()
    
    def index_documents(self):
        """
        Create embeddings for all documents
        """
        print(f"🔄 Indexing {len(self.documents)} documents...")
        start_time = time.time()
        
        self.embeddings = self.model.encode(self.documents, convert_to_tensor=True)
        
        end_time = time.time()
        print(f"   ✅ Indexed in {end_time - start_time:.3f} seconds")
        print(f"   📐 Embedding shape: {self.embeddings.shape}")
    
    def search(self, query, top_k=5, return_scores=True):
        """
        Search for most similar documents to query
        """
        # Encode query
        query_embedding = self.model.encode(query, convert_to_tensor=True)
        
        # Calculate similarities
        similarities = cos_sim(query_embedding, self.embeddings)[0]
        
        # Get top-k results
        top_indices = torch.topk(similarities, k=min(top_k, len(self.documents)))[1]
        
        results = []
        for idx in top_indices:
            doc_idx = idx.item()
            score = similarities[doc_idx].item()
            
            result = {
                'document': self.documents[doc_idx],
                'index': doc_idx,
                'score': score
            }
            results.append(result)
        
        return results
    
    def explain_search(self, query, top_k=3):
        """
        Search with detailed explanation
        """
        print(f"🔍 Searching for: '{query}'")
        print("=" * 60)
        
        results = self.search(query, top_k)
        
        for i, result in enumerate(results, 1):
            print(f"\n{i}. Score: {result['score']:.4f}")
            print(f"   Document: {result['document']}")
            print(f"   Index: {result['index']}")
        
        return results

# Create semantic search system
print("🔍 BUILDING SEMANTIC SEARCH SYSTEM")
print("=" * 40)

search_engine = SemanticSearch(sample_texts, model)

In [None]:
# Test semantic search with various queries
test_queries = [
    "artificial intelligence and data",
    "animal resting indoors", 
    "market volatility and trading",
    "programming language for development",
    "environmental issues and global warming"
]

print("🧪 SEMANTIC SEARCH EXPERIMENTS")
print("=" * 35)

for query in test_queries:
    results = search_engine.explain_search(query, top_k=3)
    print("\n" + "=" * 60 + "\n")

In [None]:
# Interactive search widget (if ipywidgets is available)
try:
    from ipywidgets import interact, widgets
    from IPython.display import display, HTML
    
    def interactive_search(query="machine learning", top_k=3):
        if query.strip():
            results = search_engine.search(query, top_k=top_k)
            
            html_output = f"<h3>🔍 Results for: '{query}'</h3>"
            
            for i, result in enumerate(results, 1):
                score_color = "green" if result['score'] > 0.7 else "orange" if result['score'] > 0.5 else "red"
                
                html_output += f"""
                <div style="border: 1px solid #ddd; margin: 10px 0; padding: 10px; border-radius: 5px;">
                    <h4>{i}. <span style="color: {score_color};">Score: {result['score']:.4f}</span></h4>
                    <p><strong>Document:</strong> {result['document']}</p>
                    <p><small><strong>Index:</strong> {result['index']}</small></p>
                </div>
                """
            
            display(HTML(html_output))
        else:
            display(HTML("<p>Please enter a search query.</p>"))
    
    # Create interactive widget
    print("\n🎮 Interactive Semantic Search:")
    interact(
        interactive_search,
        query=widgets.Text(
            value="machine learning",
            placeholder="Enter your search query...",
            description="Query:",
            style={'description_width': 'initial'}
        ),
        top_k=widgets.IntSlider(
            value=3,
            min=1,
            max=len(sample_texts),
            description="Top K results:",
            style={'description_width': 'initial'}
        )
    )
    
except ImportError:
    print("\n⚠️  Interactive widgets not available. You can still use search_engine.explain_search(query) directly.")
    
    # Alternative: simple function-based search
    def try_search(query):
        return search_engine.explain_search(query, top_k=3)
    
    print("\n💡 Try: try_search('your query here')")

## ⚡ Exercise 5: Efficient Batch Processing

Let's explore how to handle larger datasets efficiently.

In [None]:
# Generate a larger dataset for performance testing
import random

# Base templates for generating varied content
templates = {
    'tech': [
        "The {technology} framework provides {feature} for developers",
        "{language} programming language supports {capability} development",
        "Implementing {concept} requires understanding of {skill}",
        "Modern {tool} enables efficient {process} management"
    ],
    'business': [
        "The company's {metric} increased by {percent}% this quarter",
        "{department} team achieved {goal} through {strategy}",
        "Market {condition} led to {outcome} in sales performance",
        "Strategic {initiative} resulted in improved {result}"
    ],
    'science': [
        "Research shows that {phenomenon} affects {system} significantly",
        "The {study} demonstrates {finding} in {field} research",
        "Scientists discovered {discovery} using {method} analysis",
        "New {technology} enables better {measurement} of {variable}"
    ]
}

# Vocabulary for template filling
vocabulary = {
    'technology': ['React', 'Django', 'TensorFlow', 'Kubernetes', 'Docker'],
    'feature': ['scalability', 'security', 'performance', 'flexibility', 'reliability'],
    'language': ['Python', 'JavaScript', 'Go', 'Rust', 'TypeScript'],
    'capability': ['web', 'mobile', 'data science', 'machine learning', 'backend'],
    'concept': ['microservices', 'APIs', 'databases', 'caching', 'authentication'],
    'skill': ['algorithms', 'design patterns', 'system architecture', 'testing', 'debugging'],
    'tool': ['containers', 'orchestrators', 'CI/CD pipelines', 'monitoring systems', 'load balancers'],
    'process': ['deployment', 'scaling', 'monitoring', 'backup', 'recovery'],
    'metric': ['revenue', 'profit', 'efficiency', 'productivity', 'satisfaction'],
    'percent': ['15', '23', '8', '31', '12'],
    'department': ['Sales', 'Marketing', 'Engineering', 'Support', 'Operations'],
    'goal': ['targets', 'objectives', 'milestones', 'KPIs', 'deliverables'],
    'strategy': ['optimization', 'automation', 'collaboration', 'innovation', 'training'],
    'condition': ['growth', 'volatility', 'stability', 'expansion', 'consolidation'],
    'outcome': ['improvements', 'increases', 'gains', 'successes', 'achievements'],
    'initiative': ['partnerships', 'investments', 'transformations', 'reorganizations', 'expansions'],
    'result': ['efficiency', 'quality', 'speed', 'accuracy', 'customer satisfaction'],
    'phenomenon': ['climate change', 'genetic variation', 'neural activity', 'chemical reactions', 'electromagnetic fields'],
    'system': ['ecosystems', 'organisms', 'networks', 'processes', 'structures'],
    'study': ['experiment', 'analysis', 'investigation', 'survey', 'trial'],
    'finding': ['correlations', 'patterns', 'relationships', 'mechanisms', 'effects'],
    'field': ['biology', 'chemistry', 'physics', 'psychology', 'neuroscience'],
    'discovery': ['compounds', 'proteins', 'mechanisms', 'pathways', 'interactions'],
    'method': ['statistical', 'computational', 'experimental', 'observational', 'theoretical'],
    'measurement': ['monitoring', 'tracking', 'assessment', 'evaluation', 'quantification'],
    'variable': ['temperature', 'pressure', 'concentration', 'activity', 'response']
}

def generate_text_dataset(num_texts=1000):
    """
    Generate a synthetic dataset for performance testing
    """
    texts = []
    
    for _ in range(num_texts):
        # Choose random category and template
        category = random.choice(list(templates.keys()))
        template = random.choice(templates[category])
        
        # Fill template with random vocabulary
        text = template
        for placeholder in vocabulary:
            if f'{{{placeholder}}}' in text:
                replacement = random.choice(vocabulary[placeholder])
                text = text.replace(f'{{{placeholder}}}', replacement)
        
        texts.append(text)
    
    return texts

# Generate test dataset
print("📊 GENERATING LARGE DATASET FOR PERFORMANCE TESTING")
print("=" * 55)

large_dataset = generate_text_dataset(1000)
print(f"✅ Generated {len(large_dataset)} synthetic texts")
print(f"\n📝 Sample texts:")
for i in range(5):
    print(f"   {i+1}. {large_dataset[i]}")

In [None]:
# Compare different batch processing approaches
def benchmark_embedding_generation(texts, model, batch_sizes=[1, 16, 64, 128]):
    """
    Benchmark embedding generation with different batch sizes
    """
    results = {}
    
    # Test different batch sizes
    for batch_size in batch_sizes:
        print(f"\n🔄 Testing batch size: {batch_size}")
        
        start_time = time.time()
        
        if batch_size == 1:
            # One at a time
            embeddings = []
            for text in texts[:100]:  # Limit for timing
                emb = model.encode([text])
                embeddings.append(emb[0])
            embeddings = np.array(embeddings)
        else:
            # Batch processing
            embeddings = model.encode(texts[:100], batch_size=batch_size)
        
        end_time = time.time()
        processing_time = end_time - start_time
        
        results[batch_size] = {
            'time': processing_time,
            'texts_per_second': 100 / processing_time,
            'embedding_shape': embeddings.shape
        }
        
        print(f"   Time: {processing_time:.3f}s")
        print(f"   Speed: {results[batch_size]['texts_per_second']:.1f} texts/second")
    
    return results

# Run benchmark
print("⚡ BATCH PROCESSING PERFORMANCE BENCHMARK")
print("=" * 45)

batch_results = benchmark_embedding_generation(large_dataset, model)

# Visualize results
batch_sizes = list(batch_results.keys())
processing_times = [batch_results[bs]['time'] for bs in batch_sizes]
speeds = [batch_results[bs]['texts_per_second'] for bs in batch_sizes]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Processing time vs batch size
ax1.plot(batch_sizes, processing_times, 'bo-', linewidth=2, markersize=8)
ax1.set_xlabel('Batch Size')
ax1.set_ylabel('Processing Time (seconds)')
ax1.set_title('Processing Time vs Batch Size\n(100 texts)')
ax1.grid(True, alpha=0.3)
ax1.set_xscale('log')

# Throughput vs batch size
ax2.plot(batch_sizes, speeds, 'ro-', linewidth=2, markersize=8)
ax2.set_xlabel('Batch Size')
ax2.set_ylabel('Texts per Second')
ax2.set_title('Throughput vs Batch Size')
ax2.grid(True, alpha=0.3)
ax2.set_xscale('log')

plt.tight_layout()
plt.show()

# Find optimal batch size
optimal_batch_size = max(batch_results.keys(), key=lambda bs: batch_results[bs]['texts_per_second'])
optimal_speed = batch_results[optimal_batch_size]['texts_per_second']

print(f"\n🏆 Optimal batch size: {optimal_batch_size}")
print(f"   Speed: {optimal_speed:.1f} texts/second")
print(f"   Speedup vs batch=1: {optimal_speed / batch_results[1]['texts_per_second']:.1f}x")

In [None]:
# Memory-efficient processing for very large datasets
def memory_efficient_embedding(texts, model, batch_size=64, max_memory_mb=500):
    """
    Process embeddings in chunks to manage memory usage
    """
    print(f"💾 Processing {len(texts)} texts with memory management...")
    
    # Estimate memory per batch
    sample_embedding = model.encode([texts[0]])
    embedding_dim = sample_embedding.shape[1]
    bytes_per_embedding = embedding_dim * 4  # float32
    embeddings_per_mb = (1024 * 1024) / bytes_per_embedding
    
    max_embeddings = int(max_memory_mb * embeddings_per_mb)
    effective_batch_size = min(batch_size, max_embeddings)
    
    print(f"   Embedding dimensions: {embedding_dim}")
    print(f"   Memory per embedding: {bytes_per_embedding} bytes")
    print(f"   Effective batch size: {effective_batch_size}")
    print(f"   Estimated memory usage: {effective_batch_size * bytes_per_embedding / 1024 / 1024:.1f} MB per batch")
    
    all_embeddings = []
    
    for i in range(0, len(texts), effective_batch_size):
        batch_texts = texts[i:i + effective_batch_size]
        batch_embeddings = model.encode(batch_texts)
        all_embeddings.append(batch_embeddings)
        
        if (i // effective_batch_size + 1) % 10 == 0:
            print(f"   Processed {i + len(batch_texts)} / {len(texts)} texts...")
    
    # Concatenate all embeddings
    final_embeddings = np.vstack(all_embeddings)
    
    print(f"   ✅ Complete! Final shape: {final_embeddings.shape}")
    return final_embeddings

# Test memory-efficient processing
print("\n💾 MEMORY-EFFICIENT PROCESSING TEST")
print("=" * 40)

# Process first 500 texts efficiently
efficient_embeddings = memory_efficient_embedding(
    large_dataset[:500], 
    model, 
    batch_size=optimal_batch_size,
    max_memory_mb=100  # Limit to 100MB
)

print(f"\n📊 Results:")
print(f"   Processed {efficient_embeddings.shape[0]} embeddings")
print(f"   Dimensions: {efficient_embeddings.shape[1]}")
print(f"   Memory usage: {efficient_embeddings.nbytes / 1024 / 1024:.1f} MB")

## 🧠 Key Takeaways

From this module, you should now understand:

### 🔢 Embedding Fundamentals:
1. **Dense vectors**: Embeddings are dense, high-dimensional vectors (not sparse like TF-IDF)
2. **Semantic coordinates**: Each dimension captures aspects of meaning
3. **Normalized values**: Values typically range around [-1, 1] with normal distribution
4. **Consistent magnitude**: Similar magnitude across different texts indicates good normalization

### 📏 Similarity Metrics:
- **Cosine similarity**: Best for text, measures angle (direction) not magnitude
- **Dot product**: Equivalent to cosine for normalized vectors, computationally faster
- **Euclidean distance**: Considers absolute distance, sensitive to magnitude
- **For RAG systems**: Cosine similarity is the standard choice

### 🎨 Visualization Insights:
1. **UMAP > t-SNE**: Better preserves both local and global structure
2. **Semantic clustering**: Similar texts naturally cluster together
3. **3D exploration**: Interactive 3D plots reveal more structure than 2D
4. **Distance = similarity**: Closer points in visualization = higher semantic similarity

### 🔍 Practical Search:
- **Simple implementation**: Just compute cosine similarity between query and document embeddings
- **Top-k retrieval**: Sort by similarity and return best matches
- **Real-time capable**: Modern embeddings are fast enough for interactive search
- **No keyword matching**: Semantic search finds conceptually similar content, not just word matches

### ⚡ Performance Optimization:
1. **Batch processing**: 10-50x speedup over individual processing
2. **Optimal batch size**: Usually 64-128 for good GPU utilization
3. **Memory management**: Process in chunks for large datasets
4. **PyTorch tensors**: Use GPU acceleration when available

### 🎯 Production Guidelines:
- **Pre-compute embeddings**: Generate document embeddings offline
- **Cache results**: Store embeddings to avoid recomputation
- **Monitor memory**: Large embedding datasets can consume significant RAM
- **Consider quantization**: Reduce precision for memory savings if needed

## 🎯 Next Steps

In **Module 6**, we'll store these embeddings efficiently:
- Vector database options (Chroma, Pinecone, Weaviate, Qdrant)
- Indexing strategies for fast similarity search
- CRUD operations with metadata
- Performance optimization and scaling
- Production deployment considerations

You now have the foundational skills to work with embeddings in RAG systems!

## 🤔 Discussion Questions

1. When might you choose euclidean distance over cosine similarity for your embeddings?
2. How would you handle embedding drift when your document corpus changes over time?
3. What strategies would you use to debug poor semantic search results?
4. How would you implement incremental embedding updates for a production system?

## 📝 Optional Exercise

**Advanced Challenge**: Build a "semantic explorer" that:
1. Takes a large collection of your domain-specific texts
2. Generates embeddings and creates an interactive 3D visualization
3. Allows clicking on points to see the text and find similar documents
4. Includes search functionality with result highlighting in the 3D space
5. Shows clustering of different topics/themes in your domain

This will give you hands-on experience with the practical aspects of working with embeddings at scale!