# üìä Embeddings Module - Complete Tutorial

This notebook provides a comprehensive guide to understanding and using embeddings in AI/ML applications.

## Table of Contents
1. [What are Embeddings?](#1-what-are-embeddings)
2. [Text Embeddings](#2-text-embeddings)
3. [Image Embeddings](#3-image-embeddings)
4. [Multi-Modal Embeddings](#4-multi-modal-embeddings)
5. [Embedding Caching](#5-embedding-caching)
6. [Similarity Search](#6-similarity-search)
7. [Practical Applications](#7-practical-applications)
8. [Best Practices](#8-best-practices)

---

## 1. What are Embeddings?

### 1.1 Definition

**Embeddings** are dense vector representations of data (text, images, audio, etc.) in a continuous vector space. They capture semantic meaning and relationships between data points.

### 1.2 Key Properties

| Property | Description |
|----------|-------------|
| **Dense** | Unlike sparse representations (one-hot, TF-IDF), embeddings are dense vectors |
| **Fixed Dimension** | Each embedding has the same dimensionality (e.g., 384, 768) |
| **Semantic** | Similar concepts have similar embeddings (close in vector space) |
| **Learnable** | Embeddings are learned from data, not hand-crafted |

### 1.3 Why Embeddings?

```
Traditional:  "cat" ‚Üí [0, 0, 0, ..., 1, 0, 0]  (10,000+ dimensions, sparse)
Embeddings:   "cat" ‚Üí [0.21, -0.45, 0.89, ...]  (384 dimensions, dense)
```

**Benefits:**
- Capture semantic relationships ("king" - "man" + "woman" ‚âà "queen")
- Enable efficient similarity search
- Work across languages and modalities
- Serve as features for downstream ML models

In [None]:
# Setup
import sys
sys.path.insert(0, '../..')  # Add project root to path

import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

# Import our embeddings module
from src.embeddings import (
    TextEmbedder,
    ImageEmbedder,
    MultiModalEmbedder,
    EmbeddingCache,
    EmbeddingConfig
)

print("‚úÖ Setup complete!")

---

## 2. Text Embeddings

### 2.1 Understanding Text Embeddings

Text embeddings transform natural language into numerical vectors that capture semantic meaning.

**Popular Models:**

| Model | Dimensions | Speed | Quality | Use Case |
|-------|------------|-------|---------|----------|
| all-MiniLM-L6-v2 | 384 | ‚ö° Fast | Good | General purpose |
| all-mpnet-base-v2 | 768 | Medium | Best | High quality |
| OpenAI text-embedding-3-small | 1536 | API | Excellent | Production |

### 2.2 Basic Usage

In [None]:
# Initialize the text embedder
text_embedder = TextEmbedder(model_name="all-MiniLM-L6-v2")

# Encode a single sentence
sentence = "Machine learning is transforming the world of technology."
embedding = text_embedder.encode(sentence)

print(f"Input: '{sentence}'")
print(f"Embedding shape: {embedding.shape}")
print(f"First 10 values: {embedding[0][:10]}")
print(f"Embedding dimension: {text_embedder.embedding_dim}")

In [None]:
# Encode multiple sentences
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "A fast auburn fox leaps above a sleepy canine.",
    "The weather is beautiful today in New York.",
    "It's sunny and warm in Manhattan this afternoon.",
    "Python is a popular programming language."
]

embeddings = text_embedder.encode(sentences)
print(f"Embeddings shape: {embeddings.shape}")
print(f"\nNumber of sentences: {len(sentences)}")
print(f"Embedding dimension: {embeddings.shape[1]}")

### 2.3 Computing Similarity

**Cosine Similarity** measures the angle between two vectors:

$$\text{cosine}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$$

- **1.0**: Identical meaning
- **0.0**: Unrelated
- **-1.0**: Opposite meaning

In [None]:
# Compute similarity between sentence pairs
pairs = [
    (sentences[0], sentences[1]),  # Similar (fox sentences)
    (sentences[2], sentences[3]),  # Similar (weather sentences)
    (sentences[0], sentences[4]),  # Different topics
]

print("Similarity Scores:")
print("=" * 60)

for s1, s2 in pairs:
    similarity = text_embedder.similarity(s1, s2)
    print(f"\nSentence 1: '{s1[:50]}...'")
    print(f"Sentence 2: '{s2[:50]}...'")
    print(f"Similarity: {similarity:.4f}")

### 2.4 Finding Similar Texts

In [None]:
# Define a knowledge base
knowledge_base = [
    "Python is a high-level programming language known for its readability.",
    "JavaScript is essential for web development and runs in browsers.",
    "Machine learning algorithms learn patterns from data.",
    "Deep learning uses neural networks with multiple layers.",
    "Natural language processing helps computers understand human language.",
    "Computer vision enables machines to interpret visual information.",
    "Reinforcement learning trains agents through rewards and penalties.",
    "Data science combines statistics, programming, and domain expertise."
]

# Search query
query = "How do neural networks learn?"

# Find most similar documents
results = text_embedder.most_similar(query, knowledge_base, top_k=3)

print(f"Query: '{query}'")
print("\nTop 3 Results:")
print("-" * 60)

for i, (text, score) in enumerate(results, 1):
    print(f"{i}. [Score: {score:.4f}] {text}")

### 2.5 Visualizing Embeddings

We use **t-SNE** to reduce high-dimensional embeddings to 2D for visualization.

In [None]:
# Categorized sentences for visualization
categories = {
    "Technology": [
        "Artificial intelligence is revolutionizing industries.",
        "Machine learning models can predict future trends.",
        "Deep neural networks power modern AI systems."
    ],
    "Nature": [
        "The forest is home to many species of birds.",
        "Mountains covered in snow look beautiful at sunrise.",
        "The ocean waves crash against the rocky shore."
    ],
    "Food": [
        "Italian pizza is loved all around the world.",
        "Sushi is a traditional Japanese dish with rice and fish.",
        "Chocolate cake is a popular dessert choice."
    ]
}

# Flatten and embed
all_texts = []
all_labels = []
for category, texts in categories.items():
    all_texts.extend(texts)
    all_labels.extend([category] * len(texts))

embeddings = text_embedder.encode(all_texts)

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
embeddings_2d = tsne.fit_transform(embeddings)

# Plot
plt.figure(figsize=(10, 8))
colors = {'Technology': 'blue', 'Nature': 'green', 'Food': 'red'}

for category in categories:
    mask = [l == category for l in all_labels]
    plt.scatter(
        embeddings_2d[mask, 0],
        embeddings_2d[mask, 1],
        c=colors[category],
        label=category,
        s=100
    )

plt.title("Text Embeddings Visualization (t-SNE)")
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("‚úÖ Notice how semantically similar texts cluster together!")

---

## 3. Image Embeddings

### 3.1 Understanding Image Embeddings

Image embeddings convert visual content into dense vectors, enabling:
- Similar image search
- Image classification
- Cross-modal search (find images using text)

**CLIP (Contrastive Language-Image Pre-training)**:
- Trained on 400M image-text pairs
- Learns joint embedding space for images and text
- Enables zero-shot image classification

### 3.2 Basic Usage

In [None]:
# Initialize image embedder
image_embedder = ImageEmbedder(model_name="openai/clip-vit-base-patch32")

print(f"Image Embedder initialized")
print(f"Embedding dimension: {image_embedder.embedding_dim}")

# Note: To use with actual images:
# embedding = image_embedder.encode("path/to/image.jpg")
# 
# Or with PIL Images:
# from PIL import Image
# img = Image.open("path/to/image.jpg")
# embedding = image_embedder.encode(img)

### 3.3 Image-Text Similarity with CLIP

CLIP projects both images and text into the same embedding space, allowing direct comparison.

In [None]:
# Demonstrate the concept with synthetic example
# In production, you would use actual images

print("CLIP Image-Text Matching Concept:")
print("="*60)
print("""
Given an image of a cat:
  - "a photo of a cat"      ‚Üí High similarity (0.85)
  - "a photo of a dog"      ‚Üí Medium similarity (0.45)
  - "a photo of a car"      ‚Üí Low similarity (0.15)
  
This enables:
  ‚úÖ Zero-shot image classification
  ‚úÖ Image search using natural language
  ‚úÖ Content-based image retrieval
""")

---

## 4. Multi-Modal Embeddings

### 4.1 Understanding Multi-Modal Embeddings

Multi-modal embeddings project different types of data (text, images, audio) into a shared vector space.

```
Text: "a cute puppy"     ‚îÄ‚îÄ‚îÄ‚îê
                            ‚îú‚îÄ‚îÄ‚Üí Same Vector Space
Image: [puppy.jpg]      ‚îÄ‚îÄ‚îÄ‚îò
```

### 4.2 Usage

In [None]:
# Initialize multi-modal embedder
multimodal_embedder = MultiModalEmbedder()

# Encode text descriptions
text_descriptions = [
    "a photo of a cat",
    "a photo of a dog",
    "a photo of a car",
    "a photo of a sunset"
]

text_embeddings = multimodal_embedder.encode_text(text_descriptions)

print(f"Text embeddings shape: {text_embeddings.shape}")
print(f"Embedding dimension: {multimodal_embedder.embedding_dim}")

# In production, you would compare with image embeddings:
# image_embedding = multimodal_embedder.encode_image("sunset.jpg")
# similarity = np.dot(text_embeddings[3], image_embedding[0])

---

## 5. Embedding Caching

### 5.1 Why Cache Embeddings?

| Without Cache | With Cache |
|--------------|------------|
| Compute every time | Compute once, reuse |
| Slow for repeated queries | Fast repeated lookups |
| Higher API costs | Reduced API costs |

### 5.2 Using EmbeddingCache

In [None]:
import time

# Create cache
cache = EmbeddingCache(max_size=1000, persist=False)

# Create embedder with cache
cached_embedder = TextEmbedder(
    model_name="all-MiniLM-L6-v2",
    cache=cache
)

test_texts = [
    "This is a test sentence for caching.",
    "Another example to demonstrate caching.",
    "Embedding cache improves performance significantly."
]

# First call - Cache miss (computes embeddings)
start = time.time()
_ = cached_embedder.encode(test_texts)
first_call_time = time.time() - start

print(f"First call (cache miss): {first_call_time:.4f}s")
print(f"Cache stats: {cache.stats}")

# Second call - Cache hit (retrieves from cache)
start = time.time()
_ = cached_embedder.encode(test_texts)
second_call_time = time.time() - start

print(f"\nSecond call (cache hit): {second_call_time:.4f}s")
print(f"Cache stats: {cache.stats}")

speedup = first_call_time / second_call_time if second_call_time > 0 else float('inf')
print(f"\n‚ö° Speedup: {speedup:.1f}x faster with cache!")

### 5.3 Cache Persistence

Enable `persist=True` to save cache to disk:

In [None]:
# Create persistent cache
persistent_cache = EmbeddingCache(
    max_size=10000,
    persist=True,
    cache_path="./embedding_cache.pkl"
)

print("Persistent cache created!")
print("- Cache is saved to disk after each write")
print("- Cache is loaded automatically on restart")
print("- Useful for production deployments")

---

## 6. Similarity Search

### 6.1 Building a Simple Search Engine

Let's build a document search engine using embeddings.

In [None]:
class EmbeddingSearchEngine:
    """
    Simple search engine using text embeddings.
    
    Example:
        >>> engine = EmbeddingSearchEngine()
        >>> engine.index(["doc1", "doc2", "doc3"])
        >>> results = engine.search("query", top_k=2)
    """
    
    def __init__(self, embedder: TextEmbedder = None):
        self.embedder = embedder or TextEmbedder()
        self.documents = []
        self.embeddings = None
    
    def index(self, documents: list):
        """Index documents by computing their embeddings."""
        self.documents = documents
        self.embeddings = self.embedder.encode(documents)
        print(f"Indexed {len(documents)} documents")
    
    def search(self, query: str, top_k: int = 5) -> list:
        """Search for documents similar to the query."""
        query_embedding = self.embedder.encode(query)[0]
        
        # Compute cosine similarities
        similarities = np.dot(self.embeddings, query_embedding)
        
        # Get top-k indices
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        return [
            {"document": self.documents[i], "score": float(similarities[i])}
            for i in top_indices
        ]


# Create search engine
search_engine = EmbeddingSearchEngine(text_embedder)

# Index some documents
documents = [
    "Python is a versatile programming language used in AI and web development.",
    "JavaScript enables interactive web pages and is essential for frontend development.",
    "Machine learning algorithms can learn patterns from data without explicit programming.",
    "Deep learning uses neural networks with many layers to learn complex representations.",
    "Natural language processing helps computers understand and generate human language.",
    "Computer vision enables machines to interpret and understand visual information.",
    "Reinforcement learning trains AI agents through trial and error with rewards.",
    "Data science combines statistics, programming, and domain knowledge to extract insights.",
    "Cloud computing provides on-demand access to computing resources over the internet.",
    "Cybersecurity protects computer systems and networks from digital attacks."
]

search_engine.index(documents)

In [None]:
# Perform searches
queries = [
    "How do neural networks work?",
    "Best language for building websites",
    "Protecting computers from hackers"
]

for query in queries:
    print(f"\nüîç Query: '{query}'")
    print("-" * 60)
    
    results = search_engine.search(query, top_k=3)
    
    for i, result in enumerate(results, 1):
        print(f"  {i}. [{result['score']:.3f}] {result['document'][:60]}...")

---

## 7. Practical Applications

### 7.1 Semantic Duplicate Detection

In [None]:
def find_duplicates(texts: list, threshold: float = 0.85) -> list:
    """
    Find semantically similar (duplicate) texts.
    
    Args:
        texts: List of texts to check
        threshold: Similarity threshold for duplicates
        
    Returns:
        List of duplicate pairs with similarity scores
    """
    embeddings = text_embedder.encode(texts)
    
    duplicates = []
    for i in range(len(texts)):
        for j in range(i + 1, len(texts)):
            similarity = np.dot(embeddings[i], embeddings[j])
            if similarity >= threshold:
                duplicates.append({
                    "text1": texts[i],
                    "text2": texts[j],
                    "similarity": float(similarity)
                })
    
    return duplicates


# Test duplicate detection
product_reviews = [
    "This product is amazing! Great quality and fast shipping.",
    "Excellent product, shipped quickly, very high quality!",
    "The weather has been nice this week.",
    "Amazing quality, fast delivery, love this product!",
    "It's been sunny and warm lately."
]

duplicates = find_duplicates(product_reviews, threshold=0.7)

print("Potential Duplicates Found:")
print("=" * 60)

for dup in duplicates:
    print(f"\nSimilarity: {dup['similarity']:.3f}")
    print(f"  Text 1: {dup['text1']}")
    print(f"  Text 2: {dup['text2']}")

### 7.2 Text Classification with Embeddings

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Sample dataset
texts_and_labels = [
    ("I love this product! It's the best!", "positive"),
    ("Amazing experience, will buy again!", "positive"),
    ("Great quality and fast shipping.", "positive"),
    ("Terrible product, waste of money.", "negative"),
    ("Worst purchase ever, very disappointed.", "negative"),
    ("Poor quality, arrived broken.", "negative"),
    ("It's okay, nothing special.", "neutral"),
    ("Average product, works as expected.", "neutral"),
]

texts = [t for t, _ in texts_and_labels]
labels = [l for _, l in texts_and_labels]

# Generate embeddings
X = text_embedder.encode(texts)

# Train classifier
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X, labels)

# Test on new examples
test_texts = [
    "This is absolutely fantastic!",
    "Complete garbage, don't buy.",
    "It does the job, I guess."
]

test_embeddings = text_embedder.encode(test_texts)
predictions = classifier.predict(test_embeddings)

print("\nSentiment Classification Results:")
print("=" * 60)

for text, pred in zip(test_texts, predictions):
    emoji = {"positive": "üòä", "negative": "üòû", "neutral": "üòê"}
    print(f"{emoji.get(pred, '?')} [{pred:8}] {text}")

---

## 8. Best Practices

### 8.1 Choosing the Right Model

| Use Case | Recommended Model | Why |
|----------|------------------|-----|
| **General Search** | all-MiniLM-L6-v2 | Fast, good quality |
| **High Accuracy** | all-mpnet-base-v2 | Best quality, slower |
| **Multilingual** | paraphrase-multilingual-* | Cross-language support |
| **Q&A Systems** | multi-qa-* | Optimized for questions |
| **Production** | OpenAI text-embedding-3-* | Excellent quality, API |

### 8.2 Performance Optimization Tips

```python
# ‚úÖ DO: Batch your embeddings
embeddings = embedder.encode(list_of_texts)  # Single call

# ‚ùå DON'T: Encode one at a time
for text in list_of_texts:
    embedding = embedder.encode(text)  # Slow!

# ‚úÖ DO: Use caching for repeated queries
cache = EmbeddingCache(max_size=10000)
embedder = TextEmbedder(cache=cache)

# ‚úÖ DO: Use GPU for large batches
embedder = TextEmbedder(use_gpu=True)

# ‚úÖ DO: Normalize embeddings for cosine similarity
config = EmbeddingConfig(normalize=True)
embedder = TextEmbedder(config=config)
```

### 8.3 Common Pitfalls

| Pitfall | Solution |
|---------|----------|
| Long texts truncated | Chunk documents into smaller pieces |
| Slow inference | Use batching, caching, GPU |
| High memory usage | Process in batches, clear cache |
| Domain mismatch | Fine-tune or use domain-specific model |

---

## üìù Summary

In this notebook, we covered:

1. **Embeddings Fundamentals** - Dense vector representations of data
2. **Text Embeddings** - Using sentence-transformers for text
3. **Image Embeddings** - Using CLIP for visual content
4. **Multi-Modal Embeddings** - Joint text-image embedding space
5. **Caching** - Improving performance with EmbeddingCache
6. **Similarity Search** - Building search engines with embeddings
7. **Applications** - Duplicate detection, classification
8. **Best Practices** - Model selection, optimization

### üéØ Key Takeaways

- Embeddings capture semantic meaning in dense vectors
- Cosine similarity measures how related two embeddings are
- Caching dramatically improves performance
- CLIP enables cross-modal (text-image) similarity
- Choose models based on your specific use case

### üìö Next Steps

1. Explore the [Retrieval Module](../week7-retrieval/) for advanced search
2. Learn about [Reranking](../week8-reranking/) to improve results
3. Build a complete [RAG Pipeline](../week9-orchestration/) with embeddings