# Module 12: Embeddings

**Goal:** Understand embeddings, compute similarity, and build a simple semantic search system.

**Prerequisites:** Basic Python, Module 10 (Feature Engineering concepts)

**Expected Runtime:** ~25 minutes

**Outputs:**
- Text embeddings visualization
- Similarity search results
- Nearest neighbor exploration

---

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
plt.rcParams['figure.figsize'] = (12, 6)

## Part 1: What Are Embeddings?

Embeddings convert complex objects (text, users, items) into dense numerical vectors where similarity in the vector space reflects similarity in meaning.

In [None]:
# Sample support tickets
tickets = [
    "I can't reset my password",
    "How do I change my login credentials",
    "Account access problem",
    "Password reset not working",
    "Where is my order?",
    "Shipping is taking too long",
    "Track my package delivery",
    "Order hasn't arrived yet",
    "I want a refund",
    "Cancel my subscription please",
    "How do I return this product",
    "Get my money back",
]

# For this demo, we'll use TF-IDF as simple embeddings
# In production, use sentence-transformers for much better semantic embeddings
vectorizer = TfidfVectorizer(stop_words='english')
embeddings = vectorizer.fit_transform(tickets).toarray()

print(f"Number of texts: {len(tickets)}")
print(f"Embedding dimensions: {embeddings.shape[1]}")
print(f"\nSample embedding (first 10 dims):")
print(embeddings[0][:10].round(3))

## Part 2: Similarity Metrics

### Cosine Similarity
Measures the angle between vectors (most common for text).

### Euclidean Distance
Measures straight-line distance (good when magnitude matters).

In [None]:
# Compute cosine similarity matrix
sim_matrix = cosine_similarity(embeddings)

# Visualize
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(sim_matrix, cmap='RdYlGn', vmin=0, vmax=1)

ax.set_xticks(range(len(tickets)))
ax.set_yticks(range(len(tickets)))
ax.set_xticklabels([t[:15] + '...' for t in tickets], rotation=45, ha='right', fontsize=8)
ax.set_yticklabels([t[:15] + '...' for t in tickets], fontsize=8)

plt.colorbar(im, label='Cosine Similarity')
plt.title('Ticket Similarity Matrix')
plt.tight_layout()
plt.show()

print("ðŸ’¡ Notice: Tickets about the same topic (password, shipping, refunds) have higher similarity.")

In [None]:
# Compare cosine vs euclidean
def compare_metrics(idx1, idx2):
    cos_sim = cosine_similarity(embeddings[idx1:idx1+1], embeddings[idx2:idx2+1])[0][0]
    euc_dist = euclidean_distances(embeddings[idx1:idx1+1], embeddings[idx2:idx2+1])[0][0]
    
    print(f"Text 1: '{tickets[idx1]}'")
    print(f"Text 2: '{tickets[idx2]}'")
    print(f"Cosine Similarity: {cos_sim:.3f}")
    print(f"Euclidean Distance: {euc_dist:.3f}")
    print()

print("=== Similar Texts ===")
compare_metrics(0, 1)  # Both about password/login

print("=== Different Texts ===")
compare_metrics(0, 8)  # Password vs refund

## Part 3: Nearest Neighbor Search

Find the most similar items to a query.

In [None]:
# Build nearest neighbors index
nn = NearestNeighbors(n_neighbors=3, metric='cosine')
nn.fit(embeddings)

def search(query, k=3):
    """Find k most similar texts to a query."""
    query_embedding = vectorizer.transform([query]).toarray()
    distances, indices = nn.kneighbors(query_embedding, n_neighbors=k)
    
    print(f"Query: '{query}'\n")
    print("Top matches:")
    for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
        sim = 1 - dist  # Convert distance to similarity
        print(f"  {i+1}. [{sim:.3f}] {tickets[idx]}")
    print()

# Test searches
search("login issue")
search("where is my package")
search("I want my money back")

## Part 4: Embedding Visualization

Project high-dimensional embeddings to 2D for visualization.

In [None]:
# Use t-SNE for visualization
# (For small datasets, PCA is also fine)
pca = PCA(n_components=2)
coords_2d = pca.fit_transform(embeddings)

# Define topic colors
topics = ['password'] * 4 + ['shipping'] * 4 + ['refund'] * 4
topic_colors = {'password': '#ef4444', 'shipping': '#22c55e', 'refund': '#0ea5e9'}
colors = [topic_colors[t] for t in topics]

# Plot
fig, ax = plt.subplots(figsize=(10, 8))

for topic in topic_colors:
    mask = [t == topic for t in topics]
    ax.scatter(coords_2d[mask, 0], coords_2d[mask, 1], 
               c=topic_colors[topic], label=topic.capitalize(), s=100, alpha=0.7)

# Add labels
for i, txt in enumerate(tickets):
    ax.annotate(f"{i+1}", (coords_2d[i, 0] + 0.02, coords_2d[i, 1] + 0.02), fontsize=8)

ax.set_xlabel('Component 1')
ax.set_ylabel('Component 2')
ax.set_title('Support Tickets in Embedding Space')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("ðŸ’¡ Similar tickets cluster together even without explicit labels!")

## Part 5: Using Pre-trained Embeddings (Recommended)

In production, use sentence-transformers for much better semantic embeddings.

In [None]:
# Install if needed: pip install sentence-transformers
# Uncomment to use real embeddings:

# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer('all-MiniLM-L6-v2')
# embeddings = model.encode(tickets)
# print(f"Embedding shape: {embeddings.shape}")

# For now, we'll simulate better embeddings
print("=== Simulated Semantic Embeddings ===")
print("In production, use:")
print("  from sentence_transformers import SentenceTransformer")
print("  model = SentenceTransformer('all-MiniLM-L6-v2')")
print("  embeddings = model.encode(texts)")

## Part 6: TODO - Build a Simple Duplicate Detector

In [None]:
# TODO: Find potential duplicate tickets
# Tickets with similarity > 0.8 might be duplicates

threshold = 0.7  # Adjust this

print(f"=== Potential Duplicates (similarity > {threshold}) ===")
duplicates_found = []

for i in range(len(tickets)):
    for j in range(i + 1, len(tickets)):
        sim = sim_matrix[i, j]
        if sim > threshold:
            duplicates_found.append((i, j, sim))
            print(f"\n[{sim:.3f}] Pair {i+1} & {j+1}:")
            print(f"  â†’ '{tickets[i]}'")
            print(f"  â†’ '{tickets[j]}'")

print(f"\nFound {len(duplicates_found)} potential duplicate pairs")

# TODO: Try different thresholds - what happens?

## Part 7: TODO - Threshold Selection

Different thresholds trade off precision vs recall.

In [None]:
# Analyze how threshold affects results
thresholds = np.arange(0.3, 1.0, 0.1)
counts = []

for thresh in thresholds:
    count = np.sum(sim_matrix > thresh) - len(tickets)  # Subtract diagonal
    count = count // 2  # Each pair counted twice
    counts.append(count)

plt.figure(figsize=(10, 5))
plt.bar(thresholds, counts, width=0.08, color='#6366f1', alpha=0.7)
plt.xlabel('Similarity Threshold')
plt.ylabel('Number of Pairs')
plt.title('Pairs Above Threshold vs Threshold Value')
plt.grid(True, alpha=0.3, axis='y')
plt.show()

print("ðŸ’¡ Lower threshold = more results (higher recall, lower precision)")
print("   Higher threshold = fewer results (lower recall, higher precision)")

## Part 8: TODO - Stakeholder Summary

Explain to a product manager:
1. What embeddings do and why they're useful
2. How you'd use them to improve the support system
3. What trade-offs exist in threshold selection

### Your Summary:

*Write your explanation here...*

---

## Key Takeaways

1. **Embeddings** turn text/objects into numbers that preserve meaning
2. **Cosine similarity** is the standard metric for text similarity
3. **Nearest neighbors** enable fast similarity search
4. **Threshold selection** trades off precision vs recall
5. **Pre-trained models** (sentence-transformers) are much better than TF-IDF

### Next Steps
- Explore the interactive playground
- Complete the quiz
- Try sentence-transformers for production-quality embeddings