# Module 05 - Notebook 01: Embeddings Basics

## Learning Objectives
- Understand what embeddings are and why they matter
- Learn the difference between keyword matching and semantic similarity
- Explore embedding dimensions and properties
- Visualize embeddings in reduced dimensions

---

## 1. What Are Embeddings?

**Embeddings** are numerical representations of text (or other data) as vectors in a high-dimensional space.

### Key Properties:
- **Semantic similarity**: Similar meanings â†’ Similar vectors
- **Fixed dimensions**: Each embedding has the same number of dimensions (e.g., 768, 1536)
- **Learned representations**: Created by neural networks trained on massive datasets

### Why Embeddings Matter:
- Enable semantic search (meaning-based, not just keyword)
- Power recommendation systems
- Enable clustering and classification
- Foundation for RAG (Retrieval-Augmented Generation)

## 2. Setup

In [None]:
!pip install -q sentence-transformers numpy scikit-learn matplotlib python-dotenv

In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load a small, fast model for demonstrations
model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"âœ“ Model loaded")
print(f"  Embedding dimension: {model.get_sentence_embedding_dimension()}")

## 3. Creating Your First Embeddings

In [None]:
# Simple example
text = "Machine learning is amazing!"
embedding = model.encode(text)

print(f"Text: {text}")
print(f"Embedding shape: {embedding.shape}")
print(f"First 10 values: {embedding[:10]}")
print(f"\nThis text is now represented as a {len(embedding)}-dimensional vector!")

## 4. Semantic Similarity

In [None]:
# Create embeddings for multiple sentences
sentences = [
    "I love machine learning",
    "I enjoy studying AI",
    "The weather is nice today",
    "It's a beautiful sunny day",
    "Neural networks are fascinating"
]

embeddings = model.encode(sentences)
print(f"Created {len(embeddings)} embeddings")
print(f"Shape: {embeddings.shape}")

In [None]:
# Calculate similarity matrix
similarity_matrix = cosine_similarity(embeddings)

# Display similarity between each pair
print("\nSimilarity Matrix:")
print("(1.0 = identical, 0.0 = unrelated)\n")

for i, sent1 in enumerate(sentences):
    for j, sent2 in enumerate(sentences):
        if i < j:  # Only show upper triangle
            sim = similarity_matrix[i][j]
            print(f"[{sim:.3f}] '{sent1}' <-> '{sent2}'")

## 5. Keyword vs Semantic Search

Let's compare traditional keyword matching with semantic similarity.

In [None]:
# Knowledge base
documents = [
    "Python is a programming language",
    "The snake slithered through the grass",
    "JavaScript is used for web development",
    "Coding in Python is enjoyable",
    "A python is a type of reptile"
]

query = "I want to learn programming"

# Create embeddings
doc_embeddings = model.encode(documents)
query_embedding = model.encode(query)

# Calculate similarities
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]

# Rank by similarity
ranked_indices = np.argsort(similarities)[::-1]

print(f"Query: '{query}'\n")
print("Semantic Search Results (by relevance):\n")
for rank, idx in enumerate(ranked_indices, 1):
    print(f"{rank}. [{similarities[idx]:.3f}] {documents[idx]}")

print("\nNotice: Documents about programming rank higher,")
print("even though they don't contain the exact word 'learn'!")

## 6. Visualizing Embeddings

Embeddings live in high-dimensional space (384 dimensions for this model). Let's visualize them in 2D.

In [None]:
# Create embeddings for themed sentences
tech_sentences = [
    "Machine learning algorithms",
    "Neural networks",
    "Artificial intelligence",
    "Deep learning models"
]

food_sentences = [
    "Delicious pizza",
    "Tasty pasta",
    "Fresh salad",
    "Gourmet burger"
]

sports_sentences = [
    "Playing basketball",
    "Soccer match",
    "Tennis tournament",
    "Swimming competition"
]

all_sentences = tech_sentences + food_sentences + sports_sentences
all_embeddings = model.encode(all_sentences)

# Reduce to 2D using PCA
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(all_embeddings)

# Plot
plt.figure(figsize=(12, 8))

# Plot each category with different colors
colors = ['blue', 'green', 'red']
labels = ['Tech', 'Food', 'Sports']
sizes = [len(tech_sentences), len(food_sentences), len(sports_sentences)]

start_idx = 0
for i, (size, color, label) in enumerate(zip(sizes, colors, labels)):
    end_idx = start_idx + size
    plt.scatter(
        embeddings_2d[start_idx:end_idx, 0],
        embeddings_2d[start_idx:end_idx, 1],
        c=color,
        label=label,
        s=100,
        alpha=0.6
    )
    start_idx = end_idx

# Add labels to points
for i, sent in enumerate(all_sentences):
    plt.annotate(
        sent[:20],
        (embeddings_2d[i, 0], embeddings_2d[i, 1]),
        fontsize=8,
        alpha=0.7
    )

plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('Embedding Visualization (384D â†’ 2D)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nNotice: Similar topics cluster together in embedding space!")

## 7. Embedding Properties

In [None]:
# Test various properties
test_cases = [
    ("cat", "dog", "kitten"),
    ("king", "queen", "prince"),
    ("car", "automobile", "vehicle"),
]

print("Embedding Relationships:\n")
for word1, word2, word3 in test_cases:
    emb1, emb2, emb3 = model.encode([word1, word2, word3])
    
    sim_12 = cosine_similarity([emb1], [emb2])[0][0]
    sim_13 = cosine_similarity([emb1], [emb3])[0][0]
    sim_23 = cosine_similarity([emb2], [emb3])[0][0]
    
    print(f"{word1} - {word2}: {sim_12:.3f}")
    print(f"{word1} - {word3}: {sim_13:.3f}")
    print(f"{word2} - {word3}: {sim_23:.3f}")
    print()

## 8. Exercise: Find Similar Sentences

Complete the following exercise to practice working with embeddings.

In [None]:
# TODO: Complete this exercise

def find_most_similar(query: str, candidates: list, top_k: int = 3):
    """
    Find the most similar sentences to a query.
    
    Args:
        query: Search query
        candidates: List of candidate sentences
        top_k: Number of results to return
    
    Returns:
        List of (sentence, similarity_score) tuples
    """
    # Your code here:
    # 1. Create embeddings for query and candidates
    # 2. Calculate similarities
    # 3. Sort and return top_k results
    pass

# Test your function
knowledge_base = [
    "The Earth orbits around the Sun",
    "Python is a versatile programming language",
    "The Mona Lisa was painted by Leonardo da Vinci",
    "Machine learning is a subset of AI",
    "The capital of France is Paris",
    "JavaScript runs in web browsers",
    "Mount Everest is the tallest mountain",
    "Neural networks mimic brain structure"
]

test_query = "Tell me about artificial intelligence"

# results = find_most_similar(test_query, knowledge_base)
# for sent, score in results:
#     print(f"[{score:.3f}] {sent}")

## Summary

In this notebook, you learned:
- âœ… What embeddings are and why they matter
- âœ… How to create embeddings with Sentence Transformers
- âœ… The difference between keyword and semantic search
- âœ… How to calculate semantic similarity
- âœ… How to visualize embeddings

## Key Takeaways

1. **Embeddings capture meaning**, not just words
2. **Similar meanings = similar vectors** in embedding space
3. **Cosine similarity** measures how related two embeddings are
4. **Semantic search** finds relevant content by meaning, not keywords

## Next Steps
- ðŸ“˜ Proceed to Notebook 02: OpenAI Embeddings
- ðŸ”— Learn about [Sentence Transformers](https://www.sbert.net/)
- ðŸ“š Read about [Word2Vec and embeddings history](https://en.wikipedia.org/wiki/Word2vec)