# Lab 2: Word Embeddings and Language Models

In this lab, we'll explore dense vector representations of words and build neural language models. Word embeddings capture semantic meaning in continuous vector space.

## Learning Objectives

By the end of this lab, you will:
- Understand limitations of one-hot encoding
- Implement Word2Vec (Skip-gram)
- Use pre-trained embeddings (GloVe)
- Explore word analogies and relationships
- Build RNN language models
- Generate text with LSTM

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from collections import Counter
import re

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
np.random.seed(42)
tf.random.set_seed(42)

## Part 1: One-Hot Encoding Problems

**One-hot encoding** represents each word as a vector with all zeros except one 1.

### Problems:
1. **High dimensionality**: Vocabulary size = vector size
2. **No semantic relationship**: All words equally distant
3. **Sparse**: Mostly zeros
4. **Can't generalize**: "cat" and "kitten" treated as unrelated

### Solution: Dense Embeddings
- Low-dimensional (50-300 dimensions)
- Captures semantic relationships
- Similar words have similar vectors

In [None]:
# Demonstrate one-hot encoding limitations
vocab = ['king', 'queen', 'man', 'woman', 'cat', 'dog']
vocab_size = len(vocab)

# Create one-hot vectors
one_hot = np.eye(vocab_size)

print("One-Hot Encoding:")
for word, vec in zip(vocab, one_hot):
    print(f"{word:10s}: {vec}")

# Calculate cosine similarity
def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

print("\nCosine Similarities (One-Hot):")
print(f"king vs queen: {cosine_similarity(one_hot[0], one_hot[1]):.3f}")
print(f"king vs cat: {cosine_similarity(one_hot[0], one_hot[4]):.3f}")
print(f"man vs woman: {cosine_similarity(one_hot[2], one_hot[3]):.3f}")
print("\nAll pairs are equally dissimilar (0.0)!")

## Part 2: Word2Vec - Skip-gram Model

**Word2Vec** learns embeddings by predicting context words from target word.

### Skip-gram:
- Input: Target word
- Output: Context words (surrounding words)
- Learning: Words appearing in similar contexts get similar embeddings

### Architecture:
1. Input: One-hot encoded word
2. Embedding layer: Projects to dense vector
3. Output: Probability distribution over vocabulary

In [None]:
# Simple Word2Vec implementation
class SimpleWord2Vec:
    def __init__(self, embedding_dim=50, window_size=2):
        self.embedding_dim = embedding_dim
        self.window_size = window_size
        self.word2idx = {}
        self.idx2word = {}
        self.embeddings = None
    
    def build_vocab(self, sentences):
        """Build vocabulary from sentences."""
        words = []
        for sentence in sentences:
            words.extend(sentence.lower().split())
        
        vocab = sorted(set(words))
        self.word2idx = {word: idx for idx, word in enumerate(vocab)}
        self.idx2word = {idx: word for word, idx in self.word2idx.items()}
        self.vocab_size = len(vocab)
    
    def generate_training_data(self, sentences):
        """Generate skip-gram training pairs."""
        pairs = []
        
        for sentence in sentences:
            words = sentence.lower().split()
            word_indices = [self.word2idx[w] for w in words if w in self.word2idx]
            
            for center_idx, center_word in enumerate(word_indices):
                # Get context window
                start = max(0, center_idx - self.window_size)
                end = min(len(word_indices), center_idx + self.window_size + 1)
                
                for context_idx in range(start, end):
                    if context_idx != center_idx:
                        pairs.append((center_word, word_indices[context_idx]))
        
        return np.array(pairs)
    
    def train(self, sentences, epochs=10):
        """Train Word2Vec model."""
        self.build_vocab(sentences)
        training_data = self.generate_training_data(sentences)
        
        # Build model
        input_target = layers.Input(shape=(1,))
        embedding = layers.Embedding(self.vocab_size, self.embedding_dim, 
                                    input_length=1, name='embedding')(input_target)
        embedding = layers.Flatten()(embedding)
        output = layers.Dense(self.vocab_size, activation='softmax')(embedding)
        
        model = keras.Model(inputs=input_target, outputs=output)
        model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
        
        # Train
        X = training_data[:, 0]
        y = training_data[:, 1]
        
        model.fit(X, y, epochs=epochs, batch_size=32, verbose=0)
        
        # Extract embeddings
        embedding_layer = model.get_layer('embedding')
        self.embeddings = embedding_layer.get_weights()[0]
        
        return model
    
    def get_embedding(self, word):
        """Get embedding for a word."""
        if word not in self.word2idx:
            return None
        return self.embeddings[self.word2idx[word]]
    
    def most_similar(self, word, top_n=5):
        """Find most similar words."""
        if word not in self.word2idx:
            return []
        
        word_vec = self.get_embedding(word)
        similarities = []
        
        for other_word in self.word2idx:
            if other_word != word:
                other_vec = self.get_embedding(other_word)
                sim = cosine_similarity(word_vec, other_vec)
                similarities.append((other_word, sim))
        
        return sorted(similarities, key=lambda x: x[1], reverse=True)[:top_n]

In [None]:
# Train Word2Vec on sample corpus
corpus = [
    "the king loves the queen",
    "the queen loves the king",
    "the man loves the woman",
    "the woman loves the man",
    "the cat chases the mouse",
    "the dog chases the cat",
    "the mouse runs from the cat",
    "the king and queen rule the kingdom",
    "the man and woman walk together",
    "cats and dogs are pets"
]

# Train model
w2v = SimpleWord2Vec(embedding_dim=20, window_size=2)
model = w2v.train(corpus, epochs=50)

print("Word2Vec trained!")
print(f"Vocabulary size: {w2v.vocab_size}")
print(f"Embedding dimension: {w2v.embedding_dim}")

In [None]:
# Test word similarities
test_words = ['king', 'queen', 'man', 'woman', 'cat', 'dog']

print("\nWord Similarities:\n")
for word in test_words:
    if word in w2v.word2idx:
        similar = w2v.most_similar(word, top_n=3)
        print(f"{word:10s}: {[f'{w}({s:.2f})' for w, s in similar]}")

# Test specific similarities
print("\n\nSpecific Similarities:")
pairs = [('king', 'queen'), ('man', 'woman'), ('cat', 'dog')]
for w1, w2 in pairs:
    if w1 in w2v.word2idx and w2 in w2v.word2idx:
        v1 = w2v.get_embedding(w1)
        v2 = w2v.get_embedding(w2)
        sim = cosine_similarity(v1, v2)
        print(f"{w1} <-> {w2}: {sim:.3f}")

In [None]:
# Visualize embeddings with PCA
if w2v.embeddings is not None:
    # Reduce to 2D
    pca = PCA(n_components=2)
    embeddings_2d = pca.fit_transform(w2v.embeddings)
    
    # Plot
    plt.figure(figsize=(12, 8))
    plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], alpha=0.5)
    
    # Label words
    for idx, word in w2v.idx2word.items():
        plt.annotate(word, (embeddings_2d[idx, 0], embeddings_2d[idx, 1]),
                    fontsize=12, alpha=0.8)
    
    plt.xlabel('PC 1')
    plt.ylabel('PC 2')
    plt.title('Word Embeddings Visualization (PCA)')
    plt.grid(True, alpha=0.3)
    plt.show()
    
    print("Similar words cluster together in embedding space!")

## Part 3: Word Analogies

Famous property of word embeddings:
$$\text{king} - \text{man} + \text{woman} \approx \text{queen}$$

This shows embeddings capture **semantic relationships**!

In [None]:
def solve_analogy(w2v, a, b, c, top_n=5):
    """
    Solve analogy: a is to b as c is to ?
    Using vector arithmetic: b - a + c
    """
    words = [a, b, c]
    if not all(w in w2v.word2idx for w in words):
        return None
    
    # Get vectors
    va = w2v.get_embedding(a)
    vb = w2v.get_embedding(b)
    vc = w2v.get_embedding(c)
    
    # Compute target vector
    target = vb - va + vc
    
    # Find closest words
    similarities = []
    for word in w2v.word2idx:
        if word not in [a, b, c]:
            vec = w2v.get_embedding(word)
            sim = cosine_similarity(target, vec)
            similarities.append((word, sim))
    
    return sorted(similarities, key=lambda x: x[1], reverse=True)[:top_n]

# Test analogies
print("Word Analogies:\n")

analogies = [
    ('king', 'man', 'woman'),  # king - man + woman = ?
    ('king', 'queen', 'man'),  # king - queen + man = ?
]

for a, b, c in analogies:
    result = solve_analogy(w2v, a, b, c, top_n=3)
    if result:
        print(f"{a} - {b} + {c} =")
        for word, score in result:
            print(f"  {word}: {score:.3f}")
        print()

## Part 4: RNN Language Model

A **language model** predicts the next word given previous words.

### RNN approach:
- Input: Sequence of words (embeddings)
- Hidden state: Captures context
- Output: Probability distribution over next word

In [None]:
# Prepare data for language modeling
text = " ".join(corpus)
words = text.lower().split()

# Build vocabulary
vocab = sorted(set(words))
word2idx_lm = {word: idx for idx, word in enumerate(vocab)}
idx2word_lm = {idx: word for word, idx in word2idx_lm.items()}
vocab_size_lm = len(vocab)

print(f"Vocabulary size: {vocab_size_lm}")
print(f"Total words: {len(words)}")

In [None]:
# Create training sequences
seq_length = 3
X_seq, y_seq = [], []

for i in range(len(words) - seq_length):
    seq_in = words[i:i+seq_length]
    seq_out = words[i+seq_length]
    
    X_seq.append([word2idx_lm[w] for w in seq_in])
    y_seq.append(word2idx_lm[seq_out])

X_seq = np.array(X_seq)
y_seq = np.array(y_seq)

print(f"Training sequences: {len(X_seq)}")
print(f"Sequence shape: {X_seq.shape}")

In [None]:
# Build LSTM language model
model_lm = keras.Sequential([
    layers.Embedding(vocab_size_lm, 32, input_length=seq_length),
    layers.LSTM(64, return_sequences=False),
    layers.Dense(vocab_size_lm, activation='softmax')
])

model_lm.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Train
history = model_lm.fit(
    X_seq, y_seq,
    epochs=50,
    batch_size=16,
    verbose=0
)

print("Language model trained!")
print(f"Final accuracy: {history.history['accuracy'][-1]:.3f}")

In [None]:
# Generate text
def generate_text(model, start_seq, length=10, temperature=1.0):
    """
    Generate text using the language model.
    """
    generated = start_seq.copy()
    
    for _ in range(length):
        # Prepare input
        seq = generated[-seq_length:]
        seq_encoded = np.array([[word2idx_lm[w] for w in seq]])
        
        # Predict next word
        predictions = model.predict(seq_encoded, verbose=0)[0]
        predictions = np.log(predictions + 1e-7) / temperature
        exp_preds = np.exp(predictions)
        predictions = exp_preds / np.sum(exp_preds)
        
        next_idx = np.random.choice(len(predictions), p=predictions)
        next_word = idx2word_lm[next_idx]
        generated.append(next_word)
    
    return ' '.join(generated)

# Test generation
start_sequences = [
    ['the', 'king', 'loves'],
    ['the', 'cat', 'chases'],
    ['the', 'man', 'and']
]

print("\nText Generation:\n")
for start in start_sequences:
    generated = generate_text(model_lm, start, length=7, temperature=0.8)
    print(f"Start: {' '.join(start)}")
    print(f"Generated: {generated}\n")

## Key Takeaways

1. **Word embeddings** capture semantic meaning in dense vectors
2. **Word2Vec** learns by predicting context words
3. **Similar words** have similar embeddings (cosine similarity)
4. **Word analogies** work through vector arithmetic
5. **RNN language models** predict next words given context
6. **LSTM** handles longer sequences better than RNN
7. **Temperature** controls generation randomness
8. **Embeddings** are foundation for modern NLP

## Exercises

1. **Larger Corpus**: Train Word2Vec on Wikipedia text
2. **GloVe**: Download and use pre-trained GloVe embeddings
3. **Subword Embeddings**: Implement character-level or BPE
4. **Bidirectional LSTM**: Improve language model
5. **Perplexity**: Calculate language model perplexity
6. **Transfer**: Use embeddings in downstream task

## Next Steps

In Lab 3, we'll explore:
- Sequence-to-sequence models
- Attention mechanisms
- Neural machine translation
- Multi-head attention

Great work! You now understand how to represent words as dense vectors.