# Exercise 3: Word Embeddings

Welcome to the fascinating world of word embeddings! You'll learn how to represent words as vectors and discover semantic relationships between them.

## Learning Objectives
By the end of this exercise, you will be able to:
1. **Vector Representation**: Convert words into numerical vectors that capture semantic meaning
2. **Similarity Analysis**: Calculate and interpret word similarities using cosine similarity
3. **Word2Vec Training**: Train your own word embeddings on German text
4. **Embedding Visualization**: Create 2D visualizations of high-dimensional word vectors
5. **Analogy Tasks**: Solve word analogies using vector arithmetic (king - man + woman = queen)
6. **German Language Processing**: Handle German-specific embeddings and compound words

## What You'll Build
- German word similarity analyzer
- Custom Word2Vec model trained on German text
- Interactive word embedding visualizations
- Word analogy solver
- Semantic clustering system

## Applications
- **Search Systems**: Find semantically similar documents
- **Recommendation Engines**: Suggest related products based on descriptions
- **Translation**: Bridge languages through shared embedding spaces
- **Content Analysis**: Group similar concepts automatically

**Ready to unlock the hidden meanings in words?** üî§‚ú®

## Exercise 1: Exploring Pre-trained Word Embeddings

**Goal**: Explore semantic relationships using pre-trained German word embeddings.

**Your Tasks**: 
1. Load and explore pre-trained German embeddings
2. Calculate word similarities and find nearest neighbors
3. Visualize word relationships in 2D space
4. Solve word analogies using vector arithmetic

**Hints**:
- Use spaCy's German models for pre-trained embeddings
- Cosine similarity measures the angle between word vectors
- Similar words cluster together in embedding space
- Vector arithmetic can reveal semantic relationships

### Setup and Imports

In [None]:
# Essential imports for word embeddings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import warnings
warnings.filterwarnings('ignore')

# Import for word embeddings
try:
    from gensim.models import Word2Vec, KeyedVectors
    from gensim.utils import simple_preprocess
    print("‚úÖ Gensim imported successfully!")
except ImportError:
    print("‚ùå Please install gensim: pip install gensim")

# Try to load German spaCy model with word vectors
try:
    import spacy
    nlp = spacy.load("de_core_news_md")  # Medium model with word vectors
    print("‚úÖ German spaCy model (medium) loaded successfully!")
    print(f"   Model has {len(nlp.vocab)} vocabulary entries")
    print(f"   Vector dimensions: {nlp.vocab.vectors.shape[1] if nlp.vocab.vectors.shape else 'No vectors'}")
except ImportError:
    print("‚ùå Please install spaCy: pip install spacy")
    nlp = None
except IOError:
    try:
        nlp = spacy.load("de_core_news_sm")  # Small model fallback
        print("‚ö†Ô∏è  German spaCy model (small) loaded. Limited word vectors available.")
    except IOError:
        print("‚ùå Please install German spaCy model: python -m spacy download de_core_news_md")
        nlp = None

print("\nü§ñ Word Embedding Toolkit Ready!")
print("Available tools: Pre-trained embeddings, Word2Vec training, Similarity analysis")

### Step 1: Exploring Pre-trained spaCy Embeddings

### Step 1: Basic Word Similarity Analysis

In [None]:
def explore_word_similarities(nlp_model, word, top_n=10):
    """
    Find the most similar words to a given word using pre-trained embeddings.
    
    Args:
        nlp_model: Loaded spaCy model
        word (str): Target word to find similarities for
        top_n (int): Number of similar words to return
    
    Returns:
        list: List of (word, similarity_score) tuples
    """
    # TODO: Implement word similarity analysis:
    # 1. Get the word vector for the target word
    # 2. Calculate similarities with other words in vocabulary
    # 3. Return top N most similar words with scores
    
    if nlp_model is None:
        print("No language model loaded!")
        return []
    
    # Get the target word's vector
    target_doc = nlp_model(word)
    if not target_doc[0].has_vector:
        print(f"No vector available for word: {word}")
        return []
    
    target_vector = target_doc[0].vector
    
    # Find similar words by comparing with vocabulary
    similarities = []
    
    # Sample from vocabulary (full vocab is very large)
    vocab_sample = list(nlp_model.vocab)[:10000]  # Sample first 10k words
    
    for token in vocab_sample:
        if token.has_vector and token.is_alpha and not token.is_stop:
            similarity = target_doc[0].similarity(nlp_model(token.text)[0])
            if similarity > 0.3:  # Filter out very dissimilar words
                similarities.append((token.text, similarity))
    
    # Sort by similarity and return top N
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_n]

# Test word similarity analysis
if nlp:
    test_words = ["Hund", "Auto", "sch√∂n", "Berlin", "essen"]
    
    print("üîç Word Similarity Analysis")
    print("=" * 50)
    
    for word in test_words:
        print(f"\nWords similar to '{word}':")
        similar_words = explore_word_similarities(nlp, word, top_n=5)
        
        if similar_words:
            for sim_word, score in similar_words:
                print(f"  {sim_word}: {score:.3f}")
        else:
            print("  No similar words found or word not in vocabulary")
else:
    print("Please load a spaCy model with word vectors to run this analysis!")

### Step 2: Training Custom German Word2Vec Model

In [None]:
def train_german_word2vec(texts, vector_size=100, window=5, min_count=2):
    """
    Train a custom Word2Vec model on German text.
    
    Args:
        texts (list): List of German texts
        vector_size (int): Dimensionality of word vectors
        window (int): Context window size
        min_count (int): Minimum word frequency to include
    
    Returns:
        Word2Vec: Trained model
    """
    # TODO: Implement Word2Vec training:
    # 1. Preprocess texts (tokenization, cleaning)
    # 2. Create Word2Vec model with appropriate parameters
    # 3. Train the model on your corpus
    # 4. Save the model for later use
    
    if 'Word2Vec' not in globals():
        print("Please install gensim: pip install gensim")
        return None
    
    # Preprocess texts for training
    processed_texts = []
    for text in texts:
        # Simple tokenization and cleaning
        words = simple_preprocess(text, deacc=True, min_len=2, max_len=15)
        processed_texts.append(words)
    
    print(f"Training Word2Vec on {len(processed_texts)} documents...")
    print(f"Parameters: vector_size={vector_size}, window={window}, min_count={min_count}")
    
    # Create and train Word2Vec model
    model = Word2Vec(
        sentences=processed_texts,
        vector_size=vector_size,
        window=window,
        min_count=min_count,
        workers=4,
        sg=1,  # Skip-gram model
        epochs=10
    )
    
    print(f"Model trained! Vocabulary size: {len(model.wv.key_to_index)}")
    return model

# Create sample German texts for training
sample_german_texts = [
    "Berlin ist die Hauptstadt von Deutschland und eine wundersch√∂ne Stadt.",
    "M√ºnchen ist bekannt f√ºr das Oktoberfest und liegt in Bayern.",
    "Hamburg hat einen gro√üen Hafen und ist eine wichtige Hafenstadt.",
    "K√∂ln ist eine alte Stadt mit einem ber√ºhmten Dom.",
    "Frankfurt ist das Finanzzentrum Deutschlands mit vielen Banken.",
    "Stuttgart ist die Heimat von Mercedes-Benz und Porsche.",
    "Dresden ist eine kulturell reiche Stadt in Sachsen.",
    "Leipzig ist eine Universit√§tsstadt mit langer Geschichte.",
    "N√ºrnberg ist bekannt f√ºr Lebkuchen und Christkindlm√§rkte.",
    "Bremen ist eine Hansestadt im Norden Deutschlands."
]

# Train the Word2Vec model
custom_model = train_german_word2vec(sample_german_texts, vector_size=50, window=3)

if custom_model:
    print("\nüéØ Testing Custom Word2Vec Model:")
    test_word = "Berlin"
    try:
        similar_words = custom_model.wv.most_similar(test_word, topn=3)
        print(f"Words similar to '{test_word}':")
        for word, score in similar_words:
            print(f"  {word}: {score:.3f}")
    except KeyError:
        print(f"Word '{test_word}' not in vocabulary. Try: {list(custom_model.wv.key_to_index.keys())[:10]}")

### Step 3: Word Analogy Tasks

def solve_word_analogies(model, analogies):
    """
    Solve word analogies using vector arithmetic.
    
    Args:
        model: Trained word embedding model (Word2Vec or spaCy)
        analogies (list): List of (word1, word2, word3) tuples for "word1 is to word2 as word3 is to ?"
    
    Returns:
        list: Solutions to analogies
    """
    # TODO: Implement word analogy solving:
    # 1. Use vector arithmetic: word2 - word1 + word3 = answer
    # 2. Find the word closest to the result vector
    # 3. Handle cases where words are not in vocabulary
    
    print("üß† Solving Word Analogies...")
    print("=" * 40)
    
    results = []
    for word1, word2, word3 in analogies:
        try:
            if hasattr(model, 'wv'):  # Word2Vec model
                result = model.wv.most_similar(positive=[word2, word3], negative=[word1], topn=1)
                answer = result[0][0]
                confidence = result[0][1]
            else:  # spaCy model
                vec1 = model(word1)[0].vector
                vec2 = model(word2)[0].vector
                vec3 = model(word3)[0].vector
                result_vec = vec2 - vec1 + vec3
                # Find closest word (simplified approach)
                answer = "unknown"
                confidence = 0.0
            
            print(f"{word1} : {word2} :: {word3} : {answer} (confidence: {confidence:.3f})")
            results.append((word1, word2, word3, answer, confidence))
            
        except (KeyError, IndexError) as e:
            print(f"{word1} : {word2} :: {word3} : [word not in vocabulary]")
            results.append((word1, word2, word3, None, 0.0))
    
    return results

# Test analogies with our custom model
german_analogies = [
    ("K√∂nig", "Mann", "Frau"),      # King:Man :: Woman:?
    ("Berlin", "Deutschland", "Paris"),  # Berlin:Germany :: Paris:?
    ("gro√ü", "gr√∂√üer", "klein"),    # big:bigger :: small:?
]

if custom_model:
    analogy_results = solve_word_analogies(custom_model, german_analogies)

## Exercise Tasks

Complete the following tasks to deepen your understanding:

1. **Advanced Similarity Analysis**:
   - Compare different similarity metrics (cosine, euclidean, manhattan)
   - Analyze how similarity scores change with different vector dimensions
   - Create word similarity heatmaps for related concepts

2. **Embedding Visualization**:
   - Use t-SNE or PCA to visualize word embeddings in 2D
   - Create interactive plots with plotly
   - Identify semantic clusters in the visualization

3. **German-Specific Challenges**:
   - Handle German compound words (Komposita)
   - Analyze how umlauts affect similarity scores
   - Compare performance on formal vs. informal German text

4. **Model Comparison**:
   - Compare Word2Vec, FastText, and transformer embeddings
   - Evaluate on word similarity benchmarks
   - Analyze computational efficiency trade-offs

5. **Application Development**:
   - Build a semantic search engine for German documents
   - Create a word analogy game interface
   - Implement document similarity using averaged word vectors

## Reflection Questions

1. How do different training parameters affect embedding quality?
2. What are the advantages and disadvantages of different embedding approaches?
3. How can you evaluate embedding quality without labeled data?
4. What challenges are specific to German language embeddings?
5. How do embeddings capture semantic vs. syntactic relationships?

## Next Steps

- Explore contextual embeddings (BERT, ELMo)
- Learn about cross-lingual embeddings for translation
- Study specialized domain embeddings (medical, legal, technical)
- Investigate bias in word embeddings and mitigation strategies

In [None]:
def explore_spacy_embeddings(nlp_model, words):
    """
    Explore word embeddings using spaCy.
    
    Args:
        nlp_model: Loaded spaCy model
        words (list): List of German words to analyze
    
    Returns:
        dict: Word embeddings and similarities
    """
    if nlp_model is None:
        print("spaCy model not available")
        return None
    
    # TODO: Implement the following analysis:
    # 1. Get word vectors for each word
    # 2. Calculate pairwise similarities
    # 3. Find most similar words
    # 4. Explore word analogies
    
    results = {
        'embeddings': {},
        'similarities': {},
        'most_similar': {}
    }
    
    print("Analyzing spaCy word embeddings:")
    print("=" * 40)
    
    # Get embeddings for each word
    for word in words:
        token = nlp_model(word)[0]
        if token.has_vector:
            results['embeddings'][word] = token.vector
            print(f"Word: {word}")
            print(f"  Vector shape: {token.vector.shape}")
            print(f"  Vector norm: {np.linalg.norm(token.vector):.3f}")
            
            # Find most similar words in vocabulary
            # Note: This is a simplified approach - spaCy doesn't have direct most_similar
            similarities = []
            sample_words = ["Auto", "Haus", "Katze", "Hund", "Buch", "Computer", "Wasser", "Liebe"]
            
            for other_word in sample_words:
                if other_word != word:
                    other_token = nlp_model(other_word)[0]
                    if other_token.has_vector:
                        similarity = token.similarity(other_token)
                        similarities.append((other_word, similarity))
            
            # Sort by similarity
            similarities.sort(key=lambda x: x[1], reverse=True)
            results['most_similar'][word] = similarities[:3]
            
            print(f"  Most similar words:")
            for sim_word, sim_score in similarities[:3]:
                print(f"    {sim_word}: {sim_score:.3f}")
        else:
            print(f"Word '{word}' has no vector representation")
        print()
    
    return results

# Test words for analysis
test_words = ["K√∂nig", "K√∂nigin", "Mann", "Frau", "Berlin", "Deutschland", "Auto", "fahren"]

# Explore spaCy embeddings
spacy_results = explore_spacy_embeddings(nlp, test_words)

### Step 2: Creating Training Data for Custom Embeddings

In [None]:
def create_training_corpus():
    """
    Create a sample German corpus for training word embeddings.
    In practice, you would load a much larger corpus from files.
    
    Returns:
        list: List of tokenized sentences
    """
    # TODO: Create a diverse German text corpus:
    # 1. Include various topics and domains
    # 2. Ensure sufficient word frequency for meaningful embeddings
    # 3. Preprocess and tokenize the text
    
    german_texts = [
        # Technology and computers
        "Computer sind heute sehr wichtig f√ºr die Arbeit und das Leben.",
        "Das Internet verbindet Menschen auf der ganzen Welt miteinander.",
        "K√ºnstliche Intelligenz wird immer wichtiger in der Technologie.",
        "Smartphones und Tablets sind mobile Computer geworden.",
        "Software und Hardware m√ºssen gut zusammenarbeiten.",
        
        # Transportation
        "Autos fahren auf Stra√üen und Autobahnen durch die Stadt.",
        "Der Zug f√§hrt schnell vom Bahnhof zum n√§chsten Bahnhof.",
        "Flugzeuge fliegen hoch √ºber den Wolken zum Zielort.",
        "Fahrr√§der sind umweltfreundliche Verkehrsmittel in der Stadt.",
        "Busse transportieren viele Passagiere durch die Stadt.",
        
        # Family and relationships
        "Die Familie ist sehr wichtig f√ºr das Gl√ºck der Menschen.",
        "Eltern lieben ihre Kinder und sorgen f√ºr sie.",
        "Freunde helfen sich gegenseitig in schwierigen Zeiten.",
        "Gro√üeltern erz√§hlen ihren Enkeln interessante Geschichten.",
        "Geschwister spielen zusammen und lernen voneinander.",
        
        # Nature and animals
        "Hunde sind treue Freunde und beliebte Haustiere.",
        "Katzen sind unabh√§ngige und elegante Tiere.",
        "V√∂gel fliegen frei in der Luft und singen sch√∂ne Lieder.",
        "B√§ume wachsen in W√§ldern und Parks der Stadt.",
        "Blumen bl√ºhen im Fr√ºhling in bunten Farben.",
        
        # Food and cooking
        "Deutsche essen gerne Brot, Wurst und K√§se zum Fr√ºhst√ºck.",
        "Kochen macht Spa√ü und bringt Familien zusammen.",
        "Restaurants servieren leckere Gerichte aus aller Welt.",
        "Obst und Gem√ºse sind gesund und wichtig f√ºr die Ern√§hrung.",
        "Kuchen und Torte sind beliebte Desserts in Deutschland.",
        
        # Education and learning
        "Sch√ºler lernen in der Schule viele wichtige F√§cher.",
        "Lehrer unterrichten mit Begeisterung und Geduld.",
        "B√ºcher enthalten Wissen und spannende Geschichten.",
        "Universit√§ten bieten h√∂here Bildung und Forschung.",
        "Lernen ist ein lebenslanger Prozess f√ºr alle Menschen.",
        
        # Work and professions
        "√Ñrzte helfen kranken Menschen und retten Leben.",
        "Ingenieure entwickeln neue Technologien und Maschinen.",
        "K√ºnstler schaffen sch√∂ne Werke und inspirieren andere.",
        "Handwerker bauen und reparieren wichtige Dinge.",
        "Wissenschaftler forschen und entdecken neue Erkenntnisse."
    ]
    
    # Tokenize sentences
    tokenized_corpus = []
    for text in german_texts:
        # Simple preprocessing and tokenization
        tokens = simple_preprocess(text, deacc=True)  # Remove accents and punctuation
        tokenized_corpus.append(tokens)
    
    print(f"Created corpus with {len(tokenized_corpus)} sentences")
    print(f"Sample sentence: {tokenized_corpus[0]}")
    
    # Calculate vocabulary statistics
    all_words = [word for sentence in tokenized_corpus for word in sentence]
    unique_words = set(all_words)
    
    print(f"Total words: {len(all_words)}")
    print(f"Unique words: {len(unique_words)}")
    
    return tokenized_corpus

# Create training corpus
training_corpus = create_training_corpus()

### Step 3: Training Custom Word2Vec Model

In [None]:
def train_word2vec_model(corpus, vector_size=100, window=5, min_count=1, workers=4):
    """
    Train a custom Word2Vec model on the German corpus.
    
    Args:
        corpus (list): Tokenized sentences
        vector_size (int): Dimensionality of word vectors
        window (int): Context window size
        min_count (int): Minimum word frequency
        workers (int): Number of worker threads
    
    Returns:
        Word2Vec: Trained model
    """
    # TODO: Train Word2Vec model with different parameters:
    # 1. Try both CBOW and Skip-gram architectures
    # 2. Experiment with different vector sizes
    # 3. Test various window sizes
    # 4. Analyze the impact of min_count
    
    print("Training Word2Vec model...")
    
    # Train Skip-gram model (sg=1) - good for small datasets
    model = Word2Vec(
        sentences=corpus,
        vector_size=vector_size,
        window=window,
        min_count=min_count,
        workers=workers,
        sg=1,  # Skip-gram (1) vs CBOW (0)
        epochs=20,  # More epochs for better training
        seed=42
    )
    
    print(f"Model trained successfully!")
    print(f"Vocabulary size: {len(model.wv.key_to_index)}")
    print(f"Vector dimensions: {model.wv.vector_size}")
    
    return model

def analyze_word2vec_model(model, test_words):
    """
    Analyze the trained Word2Vec model.
    
    Args:
        model: Trained Word2Vec model  
        test_words (list): Words to analyze
    """
    print("\nWord2Vec Model Analysis:")
    print("=" * 40)
    
    wv = model.wv  # KeyedVectors object
    
    for word in test_words:
        if word in wv.key_to_index:
            print(f"\nWord: {word}")
            
            # Get most similar words
            try:
                similar_words = wv.most_similar(word, topn=3)
                print(f"Most similar words:")
                for sim_word, similarity in similar_words:
                    print(f"  {sim_word}: {similarity:.3f}")
            except:
                print(f"  Could not find similar words for '{word}'")
                
            # Get vector
            vector = wv[word]
            print(f"Vector shape: {vector.shape}")
            print(f"Vector norm: {np.linalg.norm(vector):.3f}")
        else:
            print(f"\nWord '{word}' not in vocabulary")

# Train Word2Vec model
w2v_model = train_word2vec_model(training_corpus)

# Analyze the model
analyze_word2vec_model(w2v_model, ["computer", "auto", "hund", "haus", "lernen"])

### Step 4: Word Similarity and Analogies

In [None]:
def explore_word_relationships(model):
    """
    Explore word relationships and analogies in the trained model.
    
    Args:
        model: Trained Word2Vec model
    """
    # TODO: Implement word relationship analysis:
    # 1. Calculate pairwise similarities
    # 2. Test word analogies (A is to B as C is to D)
    # 3. Find words that don't belong in a group
    # 4. Explore semantic relationships
    
    wv = model.wv
    
    print("Exploring Word Relationships:")
    print("=" * 40)
    
    # Test pairwise similarities
    word_pairs = [
        ("hund", "katze"),
        ("auto", "fahrrad"),
        ("computer", "technologie"),
        ("haus", "wohnung"),
        ("lernen", "schule")
    ]
    
    print("\nPairwise Similarities:")
    for word1, word2 in word_pairs:
        if word1 in wv.key_to_index and word2 in wv.key_to_index:
            similarity = wv.similarity(word1, word2)
            print(f"{word1} <-> {word2}: {similarity:.3f}")
        else:
            print(f"{word1} <-> {word2}: Words not in vocabulary")
    
    # Test analogies (if vocabulary is sufficient)
    print("\nWord Analogies:")
    analogies = [
        ("mann", "frau", "vater"),  # mann:frau :: vater:?
        ("auto", "fahren", "flugzeug"),  # auto:fahren :: flugzeug:?
        ("hund", "bellen", "katze"),  # hund:bellen :: katze:?
    ]
    
    for word1, word2, word3 in analogies:
        if all(word in wv.key_to_index for word in [word1, word2, word3]):
            try:
                # A is to B as C is to ?
                result = wv.most_similar(positive=[word2, word3], negative=[word1], topn=1)
                if result:
                    answer, score = result[0]
                    print(f"{word1}:{word2} :: {word3}:{answer} (score: {score:.3f})")
            except:
                print(f"Could not compute analogy for {word1}:{word2} :: {word3}:?")
        else:
            print(f"Analogy {word1}:{word2} :: {word3}:? - missing words in vocabulary")
    
    # Find odd-one-out
    print("\nOdd-One-Out:")
    word_groups = [
        ["hund", "katze", "auto"],  # auto should be odd
        ["computer", "internet", "baum"],  # baum should be odd
        ["essen", "trinken", "fahren"]  # fahren should be odd
    ]
    
    for group in word_groups:
        available_words = [word for word in group if word in wv.key_to_index]
        if len(available_words) >= 3:
            try:
                odd_word = wv.doesnt_match(available_words)
                print(f"In {available_words}, the odd one is: {odd_word}")
            except:
                print(f"Could not find odd word in {available_words}")
        else:
            print(f"Not enough words from {group} in vocabulary")

# Explore relationships
explore_word_relationships(w2v_model)

### Step 5: Visualizing Word Embeddings

In [None]:
def visualize_embeddings(model, words_to_plot=None, method='tsne'):
    """
    Visualize word embeddings in 2D space.
    
    Args:
        model: Trained Word2Vec model
        words_to_plot (list): Specific words to visualize
        method (str): Dimensionality reduction method ('tsne' or 'pca')
    """
    # TODO: Create 2D visualization of word embeddings:
    # 1. Select representative words for visualization
    # 2. Apply dimensionality reduction (t-SNE or PCA)
    # 3. Create scatter plot with word labels
    # 4. Color-code by semantic categories if possible
    
    wv = model.wv
    
    # Select words to plot
    if words_to_plot is None:
        # Select most frequent words
        words_to_plot = list(wv.key_to_index.keys())[:30]  # Top 30 words
    
    # Filter words that exist in vocabulary
    available_words = [word for word in words_to_plot if word in wv.key_to_index]
    
    if len(available_words) < 5:
        print("Not enough words available for visualization")
        return
    
    print(f"Visualizing {len(available_words)} words using {method.upper()}")
    
    # Get word vectors
    word_vectors = [wv[word] for word in available_words]
    word_vectors = np.array(word_vectors)
    
    # Apply dimensionality reduction
    if method.lower() == 'tsne':
        reducer = TSNE(n_components=2, random_state=42, perplexity=min(30, len(available_words)-1))
    else:
        reducer = PCA(n_components=2, random_state=42)
    
    word_vectors_2d = reducer.fit_transform(word_vectors)
    
    # Create visualization
    plt.figure(figsize=(12, 10))
    
    # Define semantic categories for coloring
    categories = {
        'animals': ['hund', 'katze', 'vogel', 'tier', 'tiere'],
        'technology': ['computer', 'internet', 'technologie', 'software', 'hardware'],
        'transport': ['auto', 'zug', 'flugzeug', 'fahrrad', 'fahren'],
        'family': ['familie', 'mutter', 'vater', 'kind', 'kinder', 'eltern'],
        'education': ['schule', 'lernen', 'lehrer', 'sch√ºler', 'universit√§t', 'buch'],
        'other': []
    }
    
    # Assign categories to words
    word_categories = {}
    for word in available_words:
        assigned = False
        for category, category_words in categories.items():
            if word in category_words:
                word_categories[word] = category
                assigned = True
                break
        if not assigned:
            word_categories[word] = 'other'
    
    # Color map for categories
    colors = ['red', 'blue', 'green', 'orange', 'purple', 'gray']
    category_colors = {cat: colors[i] for i, cat in enumerate(categories.keys())}
    
    # Plot points
    for i, word in enumerate(available_words):
        x, y = word_vectors_2d[i]
        color = category_colors[word_categories[word]]
        plt.scatter(x, y, c=color, alpha=0.7, s=100)
        plt.annotate(word, (x, y), xytext=(5, 5), textcoords='offset points', 
                    fontsize=10, alpha=0.8)
    
    # Create legend
    legend_elements = [plt.Line2D([0], [0], marker='o', color='w', 
                                 markerfacecolor=category_colors[cat], 
                                 markersize=10, label=cat.capitalize())
                      for cat in categories.keys()]
    plt.legend(handles=legend_elements, loc='best')
    
    plt.title(f'Word Embeddings Visualization ({method.upper()})')
    plt.xlabel('Dimension 1')
    plt.ylabel('Dimension 2')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

# Visualize embeddings
visualize_embeddings(w2v_model, method='tsne')

### Step 6: Clustering Words by Similarity

In [None]:
def cluster_word_embeddings(model, n_clusters=5, words_to_cluster=None):
    """
    Cluster words based on their embedding similarity.
    
    Args:
        model: Trained Word2Vec model
        n_clusters (int): Number of clusters
        words_to_cluster (list): Specific words to cluster
    
    Returns:
        dict: Clustering results
    """
    # TODO: Implement word clustering:
    # 1. Select words for clustering
    # 2. Apply K-means clustering
    # 3. Analyze cluster composition
    # 4. Visualize clusters
    
    wv = model.wv
    
    # Select words to cluster
    if words_to_cluster is None:
        words_to_cluster = list(wv.key_to_index.keys())[:30]  # Top 30 words
    
    # Filter available words
    available_words = [word for word in words_to_cluster if word in wv.key_to_index]
    
    if len(available_words) < n_clusters:
        print(f"Not enough words for {n_clusters} clusters")
        return None
    
    print(f"Clustering {len(available_words)} words into {n_clusters} clusters")
    
    # Get word vectors
    word_vectors = np.array([wv[word] for word in available_words])
    
    # Apply K-means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(word_vectors)
    
    # Organize results by cluster
    clusters = {i: [] for i in range(n_clusters)}
    for word, label in zip(available_words, cluster_labels):
        clusters[label].append(word)
    
    # Display clusters
    print("\nClustering Results:")
    print("=" * 40)
    for cluster_id, words in clusters.items():
        print(f"Cluster {cluster_id}: {', '.join(words)}")
    
    # Visualize clusters
    # Reduce dimensionality for visualization
    pca = PCA(n_components=2, random_state=42)
    word_vectors_2d = pca.fit_transform(word_vectors)
    
    plt.figure(figsize=(12, 8))
    colors = ['red', 'blue', 'green', 'orange', 'purple', 'brown', 'pink', 'gray']
    
    for i, (word, label) in enumerate(zip(available_words, cluster_labels)):
        x, y = word_vectors_2d[i]
        color = colors[label % len(colors)]
        plt.scatter(x, y, c=color, alpha=0.7, s=100)
        plt.annotate(word, (x, y), xytext=(5, 5), textcoords='offset points', 
                    fontsize=10, alpha=0.8)
    
    plt.title('Word Embedding Clusters')
    plt.xlabel('PCA Dimension 1')
    plt.ylabel('PCA Dimension 2')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    return clusters

# Cluster word embeddings
clusters = cluster_word_embeddings(w2v_model, n_clusters=4)

### Step 7: Comparing Different Embedding Methods

In [None]:
def compare_embedding_methods(corpus):
    """
    Compare different embedding methods on the same corpus.
    
    Args:
        corpus (list): Tokenized sentences
    
    Returns:
        dict: Comparison results
    """
    # TODO: Train and compare different embedding methods:
    # 1. Word2Vec CBOW vs Skip-gram
    # 2. FastText (handles subwords)
    # 3. Different vector dimensions
    # 4. Compare performance on similarity tasks
    
    print("Comparing Different Embedding Methods:")
    print("=" * 50)
    
    methods = {}
    
    # Word2Vec CBOW
    print("Training Word2Vec CBOW...")
    w2v_cbow = Word2Vec(corpus, vector_size=100, window=5, min_count=1, 
                        sg=0, epochs=20, seed=42)  # sg=0 for CBOW
    methods['Word2Vec CBOW'] = w2v_cbow
    
    # Word2Vec Skip-gram
    print("Training Word2Vec Skip-gram...")
    w2v_skipgram = Word2Vec(corpus, vector_size=100, window=5, min_count=1, 
                           sg=1, epochs=20, seed=42)  # sg=1 for Skip-gram
    methods['Word2Vec Skip-gram'] = w2v_skipgram
    
    # FastText
    print("Training FastText...")
    fasttext_model = FastText(corpus, vector_size=100, window=5, min_count=1, 
                             epochs=20, seed=42)
    methods['FastText'] = fasttext_model
    
    # Compare on similarity tasks
    test_pairs = [
        ("hund", "katze"),
        ("auto", "fahrrad"),
        ("computer", "technologie"),
        ("lernen", "schule")
    ]
    
    print("\nSimilarity Comparison:")
    print("-" * 30)
    
    results = {}
    for method_name, model in methods.items():
        print(f"\n{method_name}:")
        similarities = []
        
        for word1, word2 in test_pairs:
            if word1 in model.wv.key_to_index and word2 in model.wv.key_to_index:
                sim = model.wv.similarity(word1, word2)
                similarities.append(sim)
                print(f"  {word1}-{word2}: {sim:.3f}")
            else:
                print(f"  {word1}-{word2}: Words not in vocabulary")
        
        results[method_name] = {
            'model': model,
            'vocab_size': len(model.wv.key_to_index),
            'avg_similarity': np.mean(similarities) if similarities else 0
        }
    
    # Summary comparison
    print("\n" + "=" * 50)
    print("Summary Comparison:")
    for method_name, result in results.items():
        print(f"{method_name}:")
        print(f"  Vocabulary size: {result['vocab_size']}")
        print(f"  Average similarity: {result['avg_similarity']:.3f}")
    
    return results

# Compare different methods
comparison_results = compare_embedding_methods(training_corpus)

### Step 8: Practical Application - Document Similarity

In [None]:
def document_similarity_with_embeddings(model, documents):
    """
    Calculate document similarity using word embeddings.
    
    Args:
        model: Trained Word2Vec model
        documents (list): List of documents (strings)
    
    Returns:
        numpy.array: Document similarity matrix
    """
    # TODO: Implement document similarity using embeddings:
    # 1. Convert documents to word vectors
    # 2. Aggregate words to document vectors (average, weighted average)
    # 3. Calculate pairwise document similarities
    # 4. Compare with traditional methods (TF-IDF)
    
    print("Calculating Document Similarity using Word Embeddings:")
    print("=" * 55)
    
    wv = model.wv
    
    def document_to_vector(doc_text, method='average'):
        """Convert document to vector representation."""
        words = simple_preprocess(doc_text)
        word_vectors = []
        
        for word in words:
            if word in wv.key_to_index:
                word_vectors.append(wv[word])
        
        if not word_vectors:
            return np.zeros(wv.vector_size)
        
        if method == 'average':
            return np.mean(word_vectors, axis=0)
        elif method == 'sum':
            return np.sum(word_vectors, axis=0)
        else:
            return np.mean(word_vectors, axis=0)
    
    # Convert documents to vectors
    doc_vectors = []
    for i, doc in enumerate(documents):
        doc_vec = document_to_vector(doc)
        doc_vectors.append(doc_vec)
        print(f"Document {i+1}: {len(simple_preprocess(doc))} words -> vector shape {doc_vec.shape}")
    
    doc_vectors = np.array(doc_vectors)
    
    # Calculate similarity matrix
    similarity_matrix = cosine_similarity(doc_vectors)
    
    print("\nDocument Similarity Matrix:")
    print("-" * 30)
    
    # Display similarity matrix
    for i in range(len(documents)):
        for j in range(len(documents)):
            print(f"{similarity_matrix[i][j]:.3f}", end="  ")
        print()
    
    # Find most similar document pairs
    print("\nMost Similar Document Pairs:")
    print("-" * 35)
    
    for i in range(len(documents)):
        for j in range(i+1, len(documents)):
            similarity = similarity_matrix[i][j]
            print(f"Doc {i+1} <-> Doc {j+1}: {similarity:.3f}")
    
    return similarity_matrix

# Test documents
test_documents = [
    "Computer und Technologie sind sehr wichtig f√ºr die moderne Arbeit.",
    "Hunde und Katzen sind beliebte Haustiere in deutschen Familien.",
    "Das Internet und Software ver√§ndern unser Leben t√§glich.",
    "Tiere wie V√∂gel und Fische leben in der freien Natur.",
    "Autos und Z√ºge sind wichtige Verkehrsmittel f√ºr den Transport."
]

print("Test Documents:")
for i, doc in enumerate(test_documents):
    print(f"{i+1}. {doc}")
print()

# Calculate document similarities
doc_similarities = document_similarity_with_embeddings(w2v_model, test_documents)

## Exercise Tasks

Complete the following tasks to deepen your understanding:

1. **Embedding Quality Analysis**:
   - Load larger pre-trained German embeddings (e.g., from deepset.ai)
   - Compare quality on word similarity benchmarks
   - Analyze out-of-vocabulary handling with FastText

2. **Custom Domain Embeddings**:
   - Collect domain-specific German text (news, medical, legal)
   - Train specialized embeddings for your domain
   - Compare with general-purpose embeddings

3. **Embedding Arithmetic**:
   - Explore more complex analogies and relationships
   - Test cultural and linguistic biases in embeddings
   - Implement bias detection and mitigation

4. **Application Development**:
   - Build a semantic search engine using embeddings
   - Create a document clustering system
   - Implement recommendation systems with word embeddings

5. **Evaluation Framework**:
   - Create systematic evaluation metrics
   - Benchmark different embedding methods
   - Develop intrinsic and extrinsic evaluation tasks

## Reflection Questions

1. What are the main advantages of dense embeddings over sparse representations?
2. When would you choose Word2Vec CBOW vs Skip-gram?
3. How do German language characteristics affect embedding quality?
4. What are the limitations of static word embeddings?
5. How can you evaluate embedding quality without labeled data?

## Next Steps

- Study contextual embeddings (BERT, ELMo) in the next topic
- Explore multilingual embeddings for cross-language tasks
- Learn about sentence and document embeddings
- Investigate embedding fine-tuning for specific tasks