# 🎨 Day 4: Embeddings Discovery Laboratory

## Educational Mission: Transforming Discrete Tokens into Semantic Intelligence

Welcome to our embeddings discovery laboratory! Today we bridge the tokenization foundations from Day 3 into the realm of semantic vector representations. You'll discover how AI systems transform discrete tokens into continuous vector spaces where meaning becomes mathematically measurable.

### 🎯 Learning Objectives

By completing this interactive laboratory, you will:

1. **Generate and Analyze Embeddings**: Create vector representations and understand their mathematical properties
2. **Explore Semantic Similarity**: Discover how vector mathematics captures meaning relationships
3. **Master Vector Arithmetic**: Understand how embeddings encode relational patterns
4. **Visualize Semantic Clustering**: See how related concepts cluster in high-dimensional space
5. **Build Semantic Search Systems**: Create practical applications using embedding intelligence

### 🔄 Connection to Day 3: From Tokens to Vectors

Yesterday, you discovered how tokenizers segment text into discrete units. Today, we transform those tokens into **continuous semantic vectors** that capture meaning:

```
Day 3: "The AI model" → ["The", "AI", "model"] (discrete tokens)
Day 4: ["The", "AI", "model"] → [[0.2, -0.1, 0.8...], [0.9, 0.3, -0.2...], [-0.1, 0.7, 0.4...]] (semantic vectors)
```

### 🧬 What Are Embeddings?

**Embeddings** are dense vector representations that capture semantic meaning in continuous space. Each dimension represents a learned semantic feature, enabling mathematical operations on meaning itself.

Think of embeddings as **coordinates in meaning space** - similar concepts cluster together, while different concepts remain distant.

### ⚡ Quick Setup

Run this cell to import our discovery tools and set up the learning environment:

In [None]:
# 🔧 Discovery Laboratory Setup
import os
import sys
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Dict, Tuple
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings('ignore')

# Add src directory to path for our discovery tools
sys.path.append('../src/day4')

# Import our custom embedding discovery lab
try:
    from embeddings_discovery_lab import EmbeddingDiscoveryLab
    print("✅ Embeddings Discovery Lab loaded successfully!")
except ImportError as e:
    print(f"⚠️ Discovery Lab import issue: {e}")
    print("Running in basic mode...")

# Configure display settings for educational clarity
plt.style.use('default')
np.set_printoptions(precision=3, suppress=True)

print("🎨 Embeddings Discovery Laboratory Ready!")
print("Let's explore how AI transforms tokens into semantic intelligence...")

## 🧪 Discovery Experiment 1: Basic Embedding Generation

### Educational Mission: Understanding Vector Properties

Let's start by generating embeddings for simple text and examining their mathematical properties. This reveals how meaning gets encoded into numerical vectors.

### 🔍 What to Observe:
- **Dimensionality**: How many features represent each concept?
- **Value Distribution**: Are values positive, negative, or mixed?
- **Magnitude**: How "long" is each vector?
- **Sparsity**: How many dimensions are zero vs. non-zero?

In [None]:
# 🎯 Initialize Discovery Lab
lab = EmbeddingDiscoveryLab()

# Educational text samples for analysis
educational_texts = [
    "artificial intelligence",
    "machine learning algorithms",
    "natural language processing",
    "deep neural networks",
    "computer vision systems"
]

print("🧪 DISCOVERY EXPERIMENT 1: EMBEDDING PROPERTY ANALYSIS")
print("=" * 55)

# Analyze each text sample
for i, text in enumerate(educational_texts, 1):
    print(f"\n📝 Sample {i}: '{text}'")

    # Generate embedding and analyze properties
    analysis = lab.analyze_embedding_properties(text)

    if analysis:
        print(f"   🔢 Token Count: {analysis['token_count']}")
        print(f"   📊 Vector Dimensions: {analysis['embedding_dimensions']}")
        print(f"   📏 Vector Magnitude: {analysis['vector_magnitude']:.3f}")
        print(f"   ⚖️ Mean Value: {analysis['mean_value']:.3f}")
        print(f"   📈 Standard Deviation: {analysis['std_deviation']:.3f}")
        print(f"   ➕ Positive Dimensions: {analysis['positive_dimensions']}")
        print(f"   ➖ Negative Dimensions: {analysis['negative_dimensions']}")
        # Show first 5 tokens
        print(f"   🔍 Tokens: {analysis['tokens'][:5]}...")

        # Educational insights
        if analysis['positive_dimensions'] > analysis['negative_dimensions']:
            print(
                f"   💡 Insight: More positive features suggest certain semantic patterns")
        else:
            print(
                f"   💡 Insight: Balanced positive/negative features indicate nuanced meaning")

print(f"\n🎓 Key Discovery:")
print(f"Each concept is represented by a high-dimensional vector where:")
print(f"- Every dimension captures a learned semantic feature")
print(f"- Vector magnitude relates to semantic 'strength'")
print(f"- Positive/negative values encode different meaning aspects")

## 🔍 Discovery Experiment 2: Semantic Similarity Exploration

### Educational Mission: Measuring Meaning Relationships

Now we explore how embeddings capture semantic relationships through mathematical similarity. This is where the magic happens - similar concepts have similar vectors!

### 🔍 What to Observe:
- **Cosine Similarity**: How "aligned" are two meaning vectors?
- **Semantic Patterns**: Do synonyms score higher than unrelated words?
- **Relationship Gradients**: How do different types of relationships score?

In [None]:
print("🔍 DISCOVERY EXPERIMENT 2: SEMANTIC SIMILARITY EXPLORATION")
print("=" * 58)

# Educational text pairs for similarity analysis
similarity_test_pairs = [
    # Synonyms (should be highly similar)
    ("happy", "joyful"),
    ("intelligent", "smart"),
    ("quickly", "rapidly"),

    # Related concepts (moderately similar)
    ("computer", "laptop"),
    ("doctor", "hospital"),
    ("book", "reading"),

    # Different domains (low similarity)
    ("mathematics", "cooking"),
    ("ocean", "software"),
    ("music", "chemistry"),

    # Opposites (interesting case!)
    ("hot", "cold"),
    ("big", "small"),
    ("light", "dark")
]

# Analyze semantic relationships
similarity_results = lab.explore_semantic_similarity(similarity_test_pairs)

# Educational analysis of results
print(f"\n📊 SIMILARITY PATTERN ANALYSIS")
print("-" * 35)

high_sim_pairs = []
medium_sim_pairs = []
low_sim_pairs = []

for pair_key, result in similarity_results.items():
    sim_score = result['cosine_similarity']
    pair_desc = f"{result['text1']} ↔ {result['text2']}"

    if sim_score > 0.7:
        high_sim_pairs.append((pair_desc, sim_score))
    elif sim_score > 0.4:
        medium_sim_pairs.append((pair_desc, sim_score))
    else:
        low_sim_pairs.append((pair_desc, sim_score))

print(f"\n🎯 High Similarity (>0.7): Strong semantic relationships")
for pair, score in high_sim_pairs:
    print(f"   {pair}: {score:.3f}")

print(f"\n🤔 Medium Similarity (0.4-0.7): Related concepts")
for pair, score in medium_sim_pairs:
    print(f"   {pair}: {score:.3f}")

print(f"\n🔄 Low Similarity (<0.4): Different semantic domains")
for pair, score in low_sim_pairs:
    print(f"   {pair}: {score:.3f}")

print(f"\n🎓 Key Discovery:")
print(f"Embeddings capture semantic relationships through vector similarity!")
print(f"- Similar meanings → similar vectors → high cosine similarity")
print(f"- Different meanings → different vectors → low cosine similarity")
print(f"- Even opposites can have medium similarity (they're related concepts!)")

## 🧮 Discovery Experiment 3: Vector Arithmetic Magic

### Educational Mission: Mathematics of Meaning

This is where embeddings reveal their true power! We can perform arithmetic operations on meaning itself. The famous example: `King - Man + Woman = Queen`

### 🔍 What to Observe:
- **Relational Patterns**: How do embeddings capture analogies?
- **Vector Arithmetic**: Can we mathematically manipulate meaning?
- **Semantic Algebra**: What happens when we add/subtract concepts?

In [None]:
print("🧮 DISCOVERY EXPERIMENT 3: VECTOR ARITHMETIC EXPLORATION")
print("=" * 55)

# Educational analogies for vector arithmetic discovery
analogy_experiments = [
    # Classic gender relationships
    ("king", "queen", "man"),        # man → woman
    ("boy", "girl", "father"),       # father → mother

    # Geographic relationships
    ("Paris", "Rome", "France"),     # France → Italy
    ("Tokyo", "Berlin", "Japan"),    # Japan → Germany

    # Activity relationships
    ("swimming", "skiing", "water"),  # water → snow
    ("reading", "watching", "book"),  # book → movie

    # Professional relationships
    ("teacher", "doctor", "school"),  # school → hospital
    ("chef", "programmer", "kitchen")  # kitchen → computer
]

print("Exploring analogies: A is to B as C is to ?")
print("Vector arithmetic: B - A + C = ?")

# Discover vector arithmetic patterns
arithmetic_results = lab.discover_vector_arithmetic(analogy_experiments)

# Educational analysis of vector arithmetic
print(f"\n📊 VECTOR ARITHMETIC PATTERN ANALYSIS")
print("-" * 42)

successful_analogies = []
partial_analogies = []

for analogy_key, result in arithmetic_results.items():
    similarity = result['similarity_score']
    relationship = result['relationship']

    if similarity > 0.6:
        successful_analogies.append((relationship, similarity))
    else:
        partial_analogies.append((relationship, similarity))

print(f"\n🎯 Strong Analogies (>0.6): Clear relational patterns")
for relationship, score in successful_analogies:
    print(f"   {relationship} (similarity: {score:.3f})")

print(f"\n🤔 Partial Analogies (≤0.6): Weaker but still meaningful")
for relationship, score in partial_analogies:
    print(f"   {relationship} (similarity: {score:.3f})")

print(f"\n🎓 Key Discovery:")
print(f"Vector arithmetic reveals the algebraic structure of meaning!")
print(f"- Embeddings encode relational patterns as vector differences")
print(f"- We can manipulate concepts mathematically: Queen = King - Man + Woman")
print(f"- Semantic relationships become geometric transformations")

## 🎨 Discovery Experiment 4: Semantic Clustering Visualization

### Educational Mission: Visualizing Meaning Space

Let's visualize how embeddings organize concepts in semantic space. Related concepts should cluster together, while different domains remain separate.

### 🔍 What to Observe:
- **Semantic Clusters**: Do related concepts group together?
- **Boundary Formation**: Where do different semantic domains separate?
- **Dimensional Reduction**: How does 1536D → 2D preserve relationships?

In [None]:
print("🎨 DISCOVERY EXPERIMENT 4: SEMANTIC CLUSTERING VISUALIZATION")
print("=" * 60)

# Educational word groups for clustering analysis
semantic_word_groups = {
    "AI_Technology": [
        "artificial intelligence",
        "machine learning",
        "neural networks",
        "deep learning",
        "computer vision"
    ],
    "Programming": [
        "python",
        "javascript",
        "programming",
        "software",
        "algorithm"
    ],
    "Science": [
        "physics",
        "chemistry",
        "biology",
        "mathematics",
        "research"
    ],
    "Nature": [
        "forest",
        "ocean",
        "mountain",
        "wildlife",
        "ecosystem"
    ],
    "Arts": [
        "painting",
        "music",
        "literature",
        "sculpture",
        "creativity"
    ]
}

print("Analyzing semantic clustering patterns...")
print(f"Word groups: {list(semantic_word_groups.keys())}")
print(
    f"Total concepts: {sum(len(words) for words in semantic_word_groups.values())}")

# Visualize semantic clustering
clustering_results = lab.visualize_semantic_clusters(semantic_word_groups)

# Educational analysis of clustering
if clustering_results:
    print(f"\n📊 CLUSTERING ANALYSIS RESULTS")
    print("-" * 35)

    total_variance = sum(clustering_results['pca_explained_variance'])
    print(f"\n🎯 Dimensionality Reduction Quality:")
    print(f"   Total variance preserved: {total_variance:.1%}")
    print(
        f"   PC1 explains: {clustering_results['pca_explained_variance'][0]:.1%}")
    print(
        f"   PC2 explains: {clustering_results['pca_explained_variance'][1]:.1%}")

    print(f"\n🔍 Cluster Assignments:")
    cluster_groups = {}
    for word, cluster in clustering_results['cluster_assignments'].items():
        if cluster not in cluster_groups:
            cluster_groups[cluster] = []
        cluster_groups[cluster].append(word)

    for cluster_id, words in cluster_groups.items():
        print(f"   Cluster {cluster_id}: {', '.join(words[:3])}...")

print(f"\n🎓 Key Discovery:")
print(f"Embeddings naturally organize concepts into semantic neighborhoods!")
print(f"- Related concepts cluster together in high-dimensional space")
print(f"- Different semantic domains form distinct regions")
print(f"- 2D visualization preserves meaningful relationships despite compression")

## 🔍 Discovery Experiment 5: Contextual Embedding Analysis

### Educational Mission: Understanding Context Dependence

One fascinating property of modern embeddings is their **context sensitivity**. The same word can have different embeddings depending on its context!

### 🔍 What to Observe:
- **Context Effects**: How does surrounding text change word meaning?
- **Polysemy**: How are multiple word meanings handled?
- **Semantic Disambiguation**: Can embeddings distinguish word senses?

In [None]:
print("🔍 DISCOVERY EXPERIMENT 5: CONTEXTUAL EMBEDDING ANALYSIS")
print("=" * 57)

# Educational examples of context-dependent meaning
contextual_examples = [
    # "Bank" in different contexts
    {
        "word": "bank",
        "contexts": [
            "I deposited money at the bank",
            "We sat by the river bank",
            "The airplane made a sharp bank to the left"
        ]
    },
    # "Apple" in different contexts
    {
        "word": "apple",
        "contexts": [
            "I ate a red apple for lunch",
            "Apple released a new iPhone",
            "The apple tree in our garden"
        ]
    },
    # "Light" in different contexts
    {
        "word": "light",
        "contexts": [
            "Turn on the light switch",
            "The box was surprisingly light",
            "Light travels at incredible speed"
        ]
    }
]

print("Analyzing how context influences embedding meaning...\n")

for example in contextual_examples:
    word = example["word"]
    contexts = example["contexts"]

    print(f"📝 Analyzing '{word}' in different contexts:")

    # Get embeddings for each context
    context_embeddings = []
    for i, context in enumerate(contexts, 1):
        embedding = lab.get_embedding(context)
        if embedding is not None:
            context_embeddings.append(embedding)
            print(f"   Context {i}: '{context}'")

    # Analyze similarities between different contexts
    if len(context_embeddings) >= 2:
        print(f"\n   🔍 Contextual Similarity Analysis:")

        for i in range(len(context_embeddings)):
            for j in range(i + 1, len(context_embeddings)):
                similarity = cosine_similarity([context_embeddings[i]], [
                                               context_embeddings[j]])[0][0]
                print(f"      Context {i+1} ↔ Context {j+1}: {similarity:.3f}")

                # Educational insights
                if similarity > 0.8:
                    print(f"         💡 High similarity - similar meaning/usage")
                elif similarity > 0.6:
                    print(f"         🤔 Moderate similarity - related but distinct")
                else:
                    print(f"         🔄 Low similarity - different semantic domains")

    print("\n" + "-" * 50 + "\n")

print(f"🎓 Key Discovery:")
print(f"Modern embeddings are context-aware and capture meaning nuances!")
print(f"- Same word → different embeddings → different meanings")
print(f"- Context provides disambiguation for polysemous words")
print(f"- Embeddings encode both word identity and contextual usage")

## 🎯 Discovery Experiment 6: Advanced Similarity Search

### Educational Mission: Building Semantic Search Systems

Now let's build a practical application! We'll create a semantic search system that finds relevant documents based on meaning rather than keyword matching.

### 🔍 What to Observe:
- **Semantic vs. Keyword Search**: How do results differ?
- **Relevance Ranking**: How are documents ordered by semantic similarity?
- **Query Understanding**: How does the system interpret search intent?

In [None]:
print("🎯 DISCOVERY EXPERIMENT 6: ADVANCED SIMILARITY SEARCH")
print("=" * 54)

# Educational document collection for semantic search
document_collection = [
    "Machine learning algorithms can identify patterns in large datasets automatically.",
    "Deep neural networks consist of multiple layers that process information hierarchically.",
    "Natural language processing enables computers to understand and generate human language.",
    "Computer vision systems can recognize objects and scenes in digital images.",
    "Reinforcement learning agents learn optimal behavior through trial and error.",
    "Transformers revolutionized language modeling with attention mechanisms.",
    "Convolutional networks excel at processing grid-like data such as images.",
    "Recurrent networks can model sequential data with temporal dependencies.",
    "Generative models create new data samples that resemble training data.",
    "Supervised learning requires labeled examples to train predictive models.",
    "Unsupervised learning discovers hidden structures in unlabeled data.",
    "Transfer learning adapts pre-trained models to new domains efficiently."
]

print(
    f"Building semantic search system with {len(document_collection)} documents...\n")

# Generate embeddings for all documents
print("📊 Generating document embeddings...")
document_embeddings = []
for i, doc in enumerate(document_collection):
    embedding = lab.get_embedding(doc)
    if embedding is not None:
        document_embeddings.append(embedding)
        print(f"   Document {i+1}: Embedded ({len(embedding)} dimensions)")

# Test semantic search with different queries
search_queries = [
    "How do neural networks learn from data?",
    "What techniques work best for image recognition?",
    "How can AI understand human speech?",
    "What approaches don't require labeled training data?"
]

print(f"\n🔍 SEMANTIC SEARCH EXPERIMENTS")
print("=" * 35)

for query_num, query in enumerate(search_queries, 1):
    print(f"\n🎯 Query {query_num}: '{query}'")

    # Generate query embedding
    query_embedding = lab.get_embedding(query)
    if query_embedding is None:
        continue

    # Calculate similarities to all documents
    similarities = []
    for doc_embedding in document_embeddings:
        similarity = cosine_similarity(
            [query_embedding], [doc_embedding])[0][0]
        similarities.append(similarity)

    # Rank documents by similarity
    ranked_docs = sorted(enumerate(similarities),
                         key=lambda x: x[1], reverse=True)

    print(f"\n   📋 Top 3 Most Relevant Documents:")
    for rank, (doc_idx, similarity) in enumerate(ranked_docs[:3], 1):
        doc_text = document_collection[doc_idx][:60] + "..."
        print(f"      {rank}. [{similarity:.3f}] {doc_text}")

        # Educational insights about relevance
        if similarity > 0.7:
            print(f"         💡 Highly relevant - strong semantic match")
        elif similarity > 0.5:
            print(f"         🤔 Moderately relevant - partial semantic overlap")
        else:
            print(f"         📊 Weakly relevant - limited semantic connection")

print(f"\n🎓 Key Discovery:")
print(f"Semantic search understands meaning beyond keywords!")
print(f"- Queries match documents by conceptual similarity")
print(f"- Results ranked by semantic relevance, not keyword frequency")
print(f"- System understands intent even with different vocabulary")

## 🎓 Day 4 Discovery Summary

### 🎯 What You've Discovered

Congratulations! You've completed a comprehensive exploration of embeddings and their remarkable properties. Here's what you've mastered:

#### 🧬 **Embedding Fundamentals**
- **Vector Representation**: How discrete tokens become continuous semantic vectors
- **Mathematical Properties**: Dimensionality, magnitude, and distribution patterns
- **Semantic Encoding**: How meaning gets captured in numerical form

#### 🔍 **Similarity and Relationships** 
- **Cosine Similarity**: Mathematical measurement of semantic closeness
- **Relationship Patterns**: How synonyms, related concepts, and opposites compare
- **Context Sensitivity**: How surrounding text influences meaning representation

#### 🧮 **Vector Arithmetic Magic**
- **Semantic Algebra**: Performing arithmetic operations on meaning itself
- **Analogy Discovery**: Using vector math to find relational patterns
- **Mathematical Meaning**: Understanding embeddings as algebraic structures

#### 🎨 **Visualization and Clustering**
- **Semantic Space**: How concepts organize in high-dimensional space
- **Clustering Patterns**: Natural grouping of related concepts
- **Dimensionality Reduction**: Preserving relationships in lower dimensions

#### 🎯 **Practical Applications**
- **Semantic Search**: Building meaning-based information retrieval
- **Relevance Ranking**: Ordering results by conceptual similarity
- **Real-world Systems**: Understanding how modern AI applications work

### 🔄 Connection to Tomorrow (Day 5)

Today's embedding discoveries set the foundation for tomorrow's attention mechanisms exploration:

```
Day 4: "king" → [0.2, -0.1, 0.8, ...] (semantic vector)
Day 5: How does "king" attend to "crown", "royal", "kingdom"? (attention patterns)
```

**Tomorrow's Preview**: Attention mechanisms use embedding similarities to determine which words should "pay attention" to each other when processing sequences!

### 🚀 Keep Exploring

Try these follow-up experiments:

1. **Custom Vocabulary**: Test embeddings with domain-specific terms from your field
2. **Language Comparison**: Compare embeddings for the same concept in different languages
3. **Temporal Analysis**: Explore how word meanings change over time
4. **Multimodal Embeddings**: Investigate image-text embedding combinations

### 💡 Key Insights to Remember

- **Embeddings Transform Discrete → Continuous**: Converting tokens into measurable meaning
- **Mathematics Captures Semantics**: Vector operations reveal semantic relationships
- **Context Matters**: Modern embeddings adapt meaning based on surrounding text
- **Practical Power**: Semantic search demonstrates real-world embedding applications

You've now mastered the foundational technology that powers modern AI language understanding! 🎉