# Module 02: Word Embeddings

**Difficulty**: ⭐⭐ Intermediate  
**Estimated Time**: 120 minutes  
**Prerequisites**: [Module 01: Text Preprocessing](01_text_preprocessing.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand the concept and mathematics of word embeddings
2. Implement and train Word2Vec (Skip-gram and CBOW) models
3. Use pre-trained GloVe and FastText embeddings
4. Discover semantic relationships (analogies like king-man+woman=queen)
5. Visualize embeddings using t-SNE and PCA
6. Compare different embedding methods and their trade-offs

## What are Word Embeddings?

**Word embeddings** are dense vector representations of words in a continuous vector space, where semantically similar words are mapped to nearby points.

### From Sparse to Dense Representations

**Traditional (One-hot encoding)**:
- "cat" = [0, 0, 1, 0, 0, ..., 0] (vocabulary size = 50,000)
- **Problem**: No notion of similarity, very sparse

**Modern (Word embeddings)**:
- "cat" = [0.2, -0.4, 0.7, ..., 0.1] (typically 100-300 dimensions)
- **Benefit**: Similar words have similar vectors

### The Distributional Hypothesis

> "You shall know a word by the company it keeps" - J.R. Firth (1957)

Words that appear in similar contexts have similar meanings:
- "The cat sits on the mat" ≈ "The dog sits on the mat"
- Therefore: cat ≈ dog (in vector space)

## Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import warnings
warnings.filterwarnings('ignore')

# NLP and embeddings
import nltk
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec, FastText
from gensim.models import KeyedVectors

# PyTorch for custom implementation
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Dimensionality reduction
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

# Visualization
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("✓ All libraries imported successfully!")

In [None]:
# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('brown', quiet=True)

print("✓ NLTK data downloaded!")

## 1. Understanding Word Embeddings: A Visual Introduction

Let's start by creating a simple example to understand how embeddings capture meaning.

In [None]:
# Simple corpus for demonstration
corpus = [
    "The cat sits on the mat",
    "The dog sits on the log",
    "Cats and dogs are animals",
    "The cat and dog play together",
    "The quick brown fox jumps over the lazy dog",
    "A cat and a fox are different animals",
]

# Tokenize
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

print("Sample corpus:")
for i, tokens in enumerate(tokenized_corpus, 1):
    print(f"{i}. {tokens}")

# Build vocabulary
vocab = set()
for tokens in tokenized_corpus:
    vocab.update(tokens)

print(f"\nVocabulary size: {len(vocab)}")
print(f"Vocabulary: {sorted(vocab)}")

### Co-occurrence Matrix

Before learning about Word2Vec, let's understand co-occurrence: how often words appear together.

In [None]:
def build_cooccurrence_matrix(tokenized_corpus, window_size=2):
    """
    Build word co-occurrence matrix.
    
    Parameters:
    -----------
    tokenized_corpus : list of list of str
        Tokenized sentences
    window_size : int
        Context window size (words before/after)
        
    Returns:
    --------
    pd.DataFrame : Co-occurrence matrix
    """
    vocab = sorted(set([word for tokens in tokenized_corpus for word in tokens]))
    word_to_idx = {word: i for i, word in enumerate(vocab)}
    
    # Initialize matrix
    cooc_matrix = np.zeros((len(vocab), len(vocab)))
    
    # Count co-occurrences
    for tokens in tokenized_corpus:
        for i, word in enumerate(tokens):
            # Get context words within window
            start = max(0, i - window_size)
            end = min(len(tokens), i + window_size + 1)
            
            for j in range(start, end):
                if i != j:
                    word_idx = word_to_idx[word]
                    context_idx = word_to_idx[tokens[j]]
                    cooc_matrix[word_idx, context_idx] += 1
    
    return pd.DataFrame(cooc_matrix, index=vocab, columns=vocab)

# Build co-occurrence matrix
cooc_df = build_cooccurrence_matrix(tokenized_corpus, window_size=2)

# Show subset for key words
key_words = ['cat', 'dog', 'fox', 'animal', 'sits']
subset = cooc_df.loc[key_words, key_words]

print("Co-occurrence matrix (subset):")
print(subset.astype(int))

In [None]:
# Visualize co-occurrence matrix
plt.figure(figsize=(10, 8))
sns.heatmap(subset, annot=True, fmt='g', cmap='YlOrRd', cbar_kws={'label': 'Co-occurrence count'})
plt.title('Word Co-occurrence Matrix (Window=2)')
plt.xlabel('Context Words')
plt.ylabel('Target Words')
plt.tight_layout()
plt.show()

print("\nObservation: 'cat' and 'dog' appear in similar contexts (high co-occurrence with 'the', 'and')")

## 2. Word2Vec: Skip-gram and CBOW

**Word2Vec** (Mikolov et al., 2013) learns word embeddings by predicting context from words or vice versa.

### Two Architectures:

**1. Skip-gram**: Predict context words from target word
- Input: "cat"
- Output: ["the", "sits", "on", "the"]
- Better for rare words, larger datasets

**2. CBOW (Continuous Bag of Words)**: Predict target word from context
- Input: ["the", "sits", "on", "the"]
- Output: "cat"
- Faster, better for frequent words

### Training Objective:

Maximize the probability of observing actual context words given the target word (Skip-gram):

$$\text{maximize} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j} | w_t)$$

Where:
- $w_t$ = target word at position t
- $w_{t+j}$ = context word
- $c$ = context window size

### 2.1 Training Word2Vec with Gensim

Let's train Word2Vec on a larger corpus using Gensim.

In [None]:
# Load Brown corpus for training
from nltk.corpus import brown

# Get sentences from Brown corpus (more data for better embeddings)
brown_sentences = brown.sents()[:10000]  # Use first 10,000 sentences

# Lowercase all words
brown_sentences = [[word.lower() for word in sent] for sent in brown_sentences]

print(f"Training corpus: {len(brown_sentences)} sentences")
print(f"Sample sentence: {brown_sentences[0][:15]}")

In [None]:
# Train Skip-gram model
skipgram_model = Word2Vec(
    sentences=brown_sentences,
    vector_size=100,      # Embedding dimension
    window=5,             # Context window size
    min_count=5,          # Ignore words with frequency < 5
    sg=1,                 # 1 = Skip-gram, 0 = CBOW
    workers=4,            # Number of threads
    epochs=10,            # Training epochs
    seed=42
)

print("✓ Skip-gram model trained!")
print(f"Vocabulary size: {len(skipgram_model.wv)}")
print(f"Embedding dimension: {skipgram_model.wv.vector_size}")

In [None]:
# Train CBOW model for comparison
cbow_model = Word2Vec(
    sentences=brown_sentences,
    vector_size=100,
    window=5,
    min_count=5,
    sg=0,  # CBOW
    workers=4,
    epochs=10,
    seed=42
)

print("✓ CBOW model trained!")

### 2.2 Exploring Word Similarities

The magic of word embeddings: finding similar words!

In [None]:
# Find similar words
test_words = ['king', 'computer', 'happy', 'run']

for word in test_words:
    if word in skipgram_model.wv:
        similar = skipgram_model.wv.most_similar(word, topn=5)
        print(f"\nWords similar to '{word}':")
        for sim_word, score in similar:
            print(f"  {sim_word:15} (similarity: {score:.3f})")
    else:
        print(f"\n'{word}' not in vocabulary")

In [None]:
# Compare Skip-gram vs CBOW
test_word = 'good'

if test_word in skipgram_model.wv and test_word in cbow_model.wv:
    sg_similar = skipgram_model.wv.most_similar(test_word, topn=5)
    cbow_similar = cbow_model.wv.most_similar(test_word, topn=5)
    
    print(f"Similar words to '{test_word}':\n")
    print("Skip-gram:".ljust(30) + "CBOW:")
    print("-" * 60)
    for (sg_word, sg_score), (cbow_word, cbow_score) in zip(sg_similar, cbow_similar):
        print(f"{sg_word:15} ({sg_score:.3f})    {cbow_word:15} ({cbow_score:.3f})")

### 2.3 Word Analogies: The Famous King - Man + Woman = Queen

Word embeddings capture semantic relationships through vector arithmetic!

In [None]:
def solve_analogy(model, word_a, word_b, word_c, topn=5):
    """
    Solve word analogy: word_a is to word_b as word_c is to ?
    
    Example: king is to man as queen is to woman
    Formula: king - man + woman ≈ queen
    """
    try:
        result = model.wv.most_similar(
            positive=[word_a, word_c],  # king + woman
            negative=[word_b],           # - man
            topn=topn
        )
        return result
    except KeyError as e:
        return f"Word not in vocabulary: {e}"

# Test analogies
analogies = [
    ('king', 'man', 'woman'),     # king - man + woman = ?
    ('good', 'better', 'bad'),    # good - better + bad = ?
    ('france', 'paris', 'italy'), # france - paris + italy = ?
]

for word_a, word_b, word_c in analogies:
    print(f"\n{word_a} - {word_b} + {word_c} = ?")
    result = solve_analogy(skipgram_model, word_a, word_b, word_c, topn=3)
    if isinstance(result, list):
        for word, score in result:
            print(f"  {word:15} (score: {score:.3f})")
    else:
        print(f"  {result}")

**Exercise 1**: Create and test your own analogies

Come up with 3 word analogies and test them using the Skip-gram model. Try different types:
1. Gender relationships (king/queen, man/woman)
2. Comparative forms (good/better, bad/worse)
3. Geography (country/capital)
4. Tense (walk/walked, run/ran)

In [None]:
# YOUR CODE HERE
# Test your own analogies

my_analogies = [
    # Add your analogies as (word_a, word_b, word_c) tuples
]

# Test them

## 3. GloVe: Global Vectors for Word Representation

**GloVe** (Pennington et al., 2014) combines:
- Global matrix factorization (like LSA)
- Local context window methods (like Word2Vec)

**Key insight**: Ratios of co-occurrence probabilities encode meaning.

### Using Pre-trained GloVe Embeddings

Pre-trained embeddings are trained on massive corpora (billions of words) and work better than training from scratch on small datasets.

In [None]:
# Download GloVe embeddings (this uses a small version)
# In practice, download from: https://nlp.stanford.edu/projects/glove/
# For this demo, we'll use gensim's downloader

import gensim.downloader as api

# Load pre-trained GloVe (this may take a few minutes first time)
print("Loading GloVe embeddings... (this may take a minute)")
glove_model = api.load('glove-wiki-gigaword-100')  # 100-dim GloVe trained on Wikipedia

print(f"✓ GloVe loaded!")
print(f"Vocabulary size: {len(glove_model)}")
print(f"Embedding dimension: {glove_model.vector_size}")

In [None]:
# Test GloVe on analogies
print("Testing GloVe on word analogies:\n")

# King - Man + Woman = ?
result = glove_model.most_similar(
    positive=['king', 'woman'],
    negative=['man'],
    topn=5
)

print("king - man + woman = ?")
for word, score in result:
    print(f"  {word:15} (score: {score:.3f})")

In [None]:
# More GloVe examples
test_words = ['python', 'neural', 'learning', 'beautiful']

for word in test_words:
    similar = glove_model.most_similar(word, topn=5)
    print(f"\nWords similar to '{word}':")
    for sim_word, score in similar:
        print(f"  {sim_word:15} (similarity: {score:.3f})")

**Exercise 2**: Compare Word2Vec and GloVe

For the same set of words, compare the similar words found by:
1. Your trained Skip-gram model
2. Pre-trained GloVe

Discuss:
- Which gives more meaningful similarities?
- Why might GloVe perform better?
- When would you use each?

In [None]:
# YOUR CODE HERE
# Compare Skip-gram and GloVe on the same words

## 4. FastText: Subword Embeddings

**FastText** (Bojanowski et al., 2016) extends Word2Vec by representing words as bags of character n-grams.

**Example**: "running" = ["run", "runn", "unni", "nnin", "ning", "running"]

**Advantages**:
- Handles out-of-vocabulary (OOV) words
- Better for morphologically rich languages
- Can generate embeddings for misspelled words

**Example**:
- Word2Vec: "running" seen, "runned" (wrong) → OOV error
- FastText: "runned" → composed from "run", "runn", "nned" → valid embedding

In [None]:
# Train FastText model
fasttext_model = FastText(
    sentences=brown_sentences,
    vector_size=100,
    window=5,
    min_count=5,
    workers=4,
    sg=1,  # Skip-gram
    epochs=10,
    seed=42
)

print("✓ FastText model trained!")
print(f"Vocabulary size: {len(fasttext_model.wv)}")

In [None]:
# Test FastText on OOV words
# Create misspelled/new words
oov_words = ['runned', 'computering', 'happyness']  # Not in original vocab

print("Testing out-of-vocabulary words:\n")

for word in oov_words:
    # Word2Vec would fail on OOV
    in_w2v = word in skipgram_model.wv
    
    # FastText can handle OOV via subwords
    if not in_w2v:
        try:
            # FastText can generate embedding even for OOV
            embedding = fasttext_model.wv[word]
            similar = fasttext_model.wv.most_similar(word, topn=3)
            
            print(f"'{word}' (OOV):")
            print(f"  Embedding shape: {embedding.shape}")
            print(f"  Similar words:")
            for sim_word, score in similar:
                print(f"    {sim_word:15} ({score:.3f})")
            print()
        except:
            print(f"'{word}': Could not generate embedding\n")

**Exercise 3**: Subword analysis

1. Create a function to extract character n-grams from a word (like FastText does)
2. Compare embeddings for morphologically related words:
   - "happy", "happier", "happiest", "happiness", "unhappy"
3. Calculate cosine similarities between them
4. Discuss: Do morphologically related words have similar embeddings?

In [None]:
# YOUR CODE HERE
def extract_ngrams(word, n=3):
    """
    Extract character n-grams from a word.
    
    Example: extract_ngrams('cat', n=3) → ['<ca', 'cat', 'at>']
    """
    # Add special boundary markers
    word = f'<{word}>'
    # YOUR CODE HERE
    pass

# Test on morphological variants
variants = ['happy', 'happier', 'happiest', 'happiness', 'unhappy']

## 5. Visualizing Embeddings

Embeddings live in high-dimensional space (typically 100-300 dimensions). We can visualize them in 2D using dimensionality reduction.

### 5.1 t-SNE Visualization

**t-SNE** (t-Distributed Stochastic Neighbor Embedding) preserves local structure, making it great for visualizing clusters.

In [None]:
# Select words to visualize
words_to_plot = [
    # Animals
    'dog', 'cat', 'horse', 'lion', 'tiger',
    # Countries
    'france', 'germany', 'italy', 'spain',
    # Numbers
    'one', 'two', 'three', 'four', 'five',
    # Colors
    'red', 'blue', 'green', 'yellow',
    # Emotions
    'happy', 'sad', 'angry', 'excited'
]

# Filter to words in vocabulary
words_in_vocab = [w for w in words_to_plot if w in glove_model]

# Get embeddings
embeddings = np.array([glove_model[word] for word in words_in_vocab])

print(f"Visualizing {len(words_in_vocab)} words")
print(f"Embedding matrix shape: {embeddings.shape}")

In [None]:
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
embeddings_2d = tsne.fit_transform(embeddings)

# Plot
plt.figure(figsize=(14, 10))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], alpha=0.7, s=100)

# Annotate points
for i, word in enumerate(words_in_vocab):
    plt.annotate(word, (embeddings_2d[i, 0], embeddings_2d[i, 1]),
                fontsize=12, alpha=0.8)

plt.title('t-SNE Visualization of Word Embeddings (GloVe)', fontsize=16)
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Observation: Semantically similar words cluster together!")

### 5.2 PCA Visualization

**PCA** (Principal Component Analysis) preserves global structure and is faster than t-SNE.

In [None]:
# Apply PCA
pca = PCA(n_components=2, random_state=42)
embeddings_pca = pca.fit_transform(embeddings)

# Plot
plt.figure(figsize=(14, 10))
plt.scatter(embeddings_pca[:, 0], embeddings_pca[:, 1], alpha=0.7, s=100)

for i, word in enumerate(words_in_vocab):
    plt.annotate(word, (embeddings_pca[i, 0], embeddings_pca[i, 1]),
                fontsize=12, alpha=0.8)

plt.title('PCA Visualization of Word Embeddings (GloVe)', fontsize=16)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**Exercise 4**: Create a semantic visualization

1. Choose a semantic category (e.g., programming languages, foods, sports)
2. Select 20-30 related words
3. Visualize using both t-SNE and PCA
4. Use different colors for subcategories
5. Discuss: Which visualization method works better for your data?

In [None]:
# YOUR CODE HERE
# Create your own semantic visualization

## 6. Embedding Quality and Evaluation

How do we measure if embeddings are good?

### 6.1 Intrinsic Evaluation: Word Similarity

Compare embedding similarities with human judgments using benchmark datasets (e.g., WordSim-353).

In [None]:
# Evaluate on word similarity task
# SimLex-999 dataset is built into Gensim
correlation = glove_model.evaluate_word_pairs(
    'wordsim353.tsv',
    dummy4unknown=True
)[0][0]  # Pearson correlation

print(f"Word similarity correlation: {correlation:.3f}")
print("(Higher is better; > 0.6 is good)")

### 6.2 Analogy Accuracy

Test on standard analogy datasets.

In [None]:
# Create simple analogy test
test_analogies = [
    # Format: (word_a, word_b, word_c, expected_word_d)
    ('man', 'woman', 'king', 'queen'),
    ('man', 'woman', 'boy', 'girl'),
    ('good', 'better', 'bad', 'worse'),
    ('big', 'bigger', 'small', 'smaller'),
]

correct = 0
total = 0

for a, b, c, expected in test_analogies:
    if all(w in glove_model for w in [a, b, c, expected]):
        result = glove_model.most_similar(positive=[c, b], negative=[a], topn=3)
        predicted = result[0][0]
        
        total += 1
        if predicted == expected:
            correct += 1
            status = "✓"
        else:
            status = "✗"
        
        print(f"{status} {a}:{b} :: {c}:{expected} → predicted: {predicted}")

accuracy = correct / total if total > 0 else 0
print(f"\nAnalogy accuracy: {accuracy:.1%} ({correct}/{total})")

## 7. Limitations of Static Word Embeddings

Despite their power, Word2Vec/GloVe/FastText have critical limitations:

### 7.1 No Context Awareness

The word "bank" has the same embedding whether it means:
- Financial institution: "I went to the **bank** to deposit money"
- River edge: "We sat on the river **bank**"

**Problem**: One vector per word, regardless of context!

In [None]:
# Demonstrate the problem
sentences = [
    "The bank approved my loan application",
    "We walked along the river bank at sunset"
]

# Get embedding for 'bank' (same for both contexts!)
if 'bank' in glove_model:
    bank_embedding = glove_model['bank']
    similar_words = glove_model.most_similar('bank', topn=5)
    
    print("The word 'bank' has only ONE embedding, regardless of context:")
    print(f"\nEmbedding shape: {bank_embedding.shape}")
    print(f"\nMost similar words to 'bank':")
    for word, score in similar_words:
        print(f"  {word:15} ({score:.3f})")
    
    print("\n⚠ This single embedding must represent BOTH financial and geographical meanings!")

### 7.2 Other Limitations

1. **Fixed vocabulary**: Can't easily add new words after training
2. **Training data bias**: Embeddings inherit biases from training data
3. **No sentence/document representation**: Must aggregate word vectors somehow
4. **Polysemy**: Multiple meanings per word not distinguished

### The Solution: Contextual Embeddings

Modern models (BERT, GPT, etc.) generate **different** embeddings for the same word in different contexts!

We'll learn about these in **Module 07: BERT and Masked Language Modeling**.

**Exercise 5**: Bias exploration

Word embeddings can contain societal biases from training data. Investigate:

1. Test analogies like: "man is to doctor as woman is to ?"
2. Compare similar words for gendered terms: "man", "woman", "he", "she"
3. Test occupation analogies and observe any gender biases
4. Discuss: What are the implications of these biases in real applications?

In [None]:
# YOUR CODE HERE
# Explore potential biases in embeddings

## Summary

### Key Concepts Covered:

1. **Word Embeddings Fundamentals**:
   - Dense vector representations of words
   - Distributional hypothesis: similar contexts → similar meanings
   - Reduces dimensionality while capturing semantics

2. **Word2Vec**:
   - Skip-gram: Predict context from word
   - CBOW: Predict word from context
   - Training objective: Maximize context prediction probability

3. **Other Methods**:
   - GloVe: Global co-occurrence statistics
   - FastText: Subword embeddings for OOV handling

4. **Applications**:
   - Semantic similarity
   - Word analogies (king - man + woman = queen)
   - Clustering and visualization

5. **Limitations**:
   - No context awareness (same word = same embedding)
   - Polysemy problem
   - Potential biases

### Comparison Table:

| Method | Pros | Cons | Best For |
|--------|------|------|----------|
| Word2Vec | Fast, efficient, good analogies | No context, OOV issues | Large datasets |
| GloVe | Captures global statistics | Slower training | Pre-trained use |
| FastText | Handles OOV, morphology | Larger model size | Morphologically rich languages |

### What's Next?

In **Module 03: Recurrent Neural Networks**, we'll learn:
- How to process sequential data (sentences)
- LSTM and GRU architectures
- Sequence classification and generation
- Moving beyond static embeddings

### Additional Resources:

- **Word2Vec Paper**: [Efficient Estimation of Word Representations](https://arxiv.org/abs/1301.3781)
- **GloVe Paper**: [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf)
- **FastText Paper**: [Enriching Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606)
- **Interactive Demo**: [Embedding Projector](https://projector.tensorflow.org/)