# Using Word2Vec Embeddings with RNN

## Overview
This notebook shows how to use **pre-trained Word2Vec embeddings** instead of learning embeddings from scratch.

### Two Approaches:
1. **Learn embeddings** (what we did before): Start random, learn during training
2. **Use Word2Vec** (this notebook): Start with pre-trained knowledge

### Benefits of Word2Vec:
- âœ“ Pre-trained on billions of words
- âœ“ Already knows semantic relationships
- âœ“ Works better with small datasets
- âœ“ Faster training

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Note: In real usage, you'd install gensim and load actual Word2Vec
# pip install gensim
# from gensim.models import KeyedVectors

np.random.seed(42)

## Part 1: Understanding Word2Vec

Word2Vec creates dense vector representations where **similar words have similar vectors**

In [None]:
# Simulating Word2Vec embeddings (normally you'd load pre-trained)
# Real Word2Vec: word2vec = KeyedVectors.load_word2vec_format('GoogleNews-vectors.bin')

# Create simulated Word2Vec embeddings (300-dim)
# In reality, these are pre-trained on billions of words
embedding_dim = 300

# Simulated embeddings with semantic structure
np.random.seed(42)

# Positive words cluster together
word2vec_embeddings = {
    'love': np.random.randn(embedding_dim) + np.array([1, 0.8, 0, 0, 0] + [0]*(embedding_dim-5)),
    'great': np.random.randn(embedding_dim) + np.array([0.9, 0.7, 0, 0, 0] + [0]*(embedding_dim-5)),
    'amazing': np.random.randn(embedding_dim) + np.array([1.1, 0.9, 0, 0, 0] + [0]*(embedding_dim-5)),
    'excellent': np.random.randn(embedding_dim) + np.array([0.95, 0.85, 0, 0, 0] + [0]*(embedding_dim-5)),
    
    # Negative words cluster together
    'hate': np.random.randn(embedding_dim) + np.array([-1, -0.8, 0, 0, 0] + [0]*(embedding_dim-5)),
    'terrible': np.random.randn(embedding_dim) + np.array([-0.9, -0.7, 0, 0, 0] + [0]*(embedding_dim-5)),
    'awful': np.random.randn(embedding_dim) + np.array([-1.1, -0.9, 0, 0, 0] + [0]*(embedding_dim-5)),
    'bad': np.random.randn(embedding_dim) + np.array([-0.8, -0.6, 0, 0, 0] + [0]*(embedding_dim-5)),
    
    # Neutral/other words
    'I': np.random.randn(embedding_dim) * 0.1,
    'the': np.random.randn(embedding_dim) * 0.1,
    'NLP': np.random.randn(embedding_dim) + np.array([0, 0, 1, 0, 0] + [0]*(embedding_dim-5)),
    'movie': np.random.randn(embedding_dim) + np.array([0, 0, 0, 1, 0] + [0]*(embedding_dim-5)),
    'bugs': np.random.randn(embedding_dim) + np.array([0, 0, 0, 0, 1] + [0]*(embedding_dim-5)),
}

print("Word2Vec Embedding Dimensions:")
print("="*60)
for word, vec in list(word2vec_embeddings.items())[:3]:
    print(f"{word:10s}: 300-dim vector, first 5: {vec[:5]}")
print("...")
print()
print("Key Property: Similar words have similar vectors!")
print("  'love' and 'great' are close in vector space")
print("  'hate' and 'terrible' are close in vector space")
print("  'love' and 'hate' are FAR apart in vector space")

## Part 2: Vocabulary and Word-to-Index Mapping

In [None]:
# Create vocabulary for our task
# In real scenario, this comes from your training data

vocab = {
    '[PAD]': 0,   # Padding token
    'I': 1,
    'love': 2,
    'hate': 3,
    'NLP': 4,
    'the': 5,
    'movie': 6,
    'great': 7,
    'terrible': 8,
    'bugs': 9,
    'amazing': 10,
    'awful': 11,
    'excellent': 12,
    'bad': 13,
}

vocab_size = len(vocab)
index_to_word = {idx: word for word, idx in vocab.items()}

print("Vocabulary:")
print("="*60)
for word, idx in vocab.items():
    print(f"{word:15s} â†’ index {idx}")
print()
print(f"Vocabulary size: {vocab_size}")

## Part 3: Create Embedding Matrix from Word2Vec

**Key step:** Map your vocabulary to Word2Vec vectors

In [None]:
def create_embedding_matrix(vocab, word2vec_embeddings, embedding_dim):
    """
    Create embedding matrix for Keras Embedding layer
    
    Args:
        vocab: dict mapping word -> index
        word2vec_embeddings: dict mapping word -> vector
        embedding_dim: dimension of embeddings (300 for Word2Vec)
    
    Returns:
        embedding_matrix: numpy array (vocab_size, embedding_dim)
    """
    vocab_size = len(vocab)
    embedding_matrix = np.zeros((vocab_size, embedding_dim))
    
    found = 0
    not_found = []
    
    for word, idx in vocab.items():
        if word in word2vec_embeddings:
            # Use pre-trained Word2Vec vector
            embedding_matrix[idx] = word2vec_embeddings[word]
            found += 1
        else:
            # Out-of-vocabulary: use random small vector
            embedding_matrix[idx] = np.random.randn(embedding_dim) * 0.01
            not_found.append(word)
    
    print(f"Embedding Matrix Created:")
    print("="*60)
    print(f"Found in Word2Vec:     {found}/{vocab_size} words")
    print(f"Not found (random):    {len(not_found)} words")
    if not_found:
        print(f"  OOV words: {not_found}")
    print(f"\nMatrix shape: {embedding_matrix.shape}")
    print(f"  (vocab_size={vocab_size}, embedding_dim={embedding_dim})")
    
    return embedding_matrix

# Create the embedding matrix
embedding_matrix = create_embedding_matrix(vocab, word2vec_embeddings, embedding_dim)

print("\nExample: 'love' embedding (first 10 dimensions):")
print(embedding_matrix[vocab['love']][:10])

## Part 4: Training Data

In [None]:
# Sample training data
training_data = [
    ("I love NLP", 1),
    ("I hate bugs", 0),
    ("the movie great", 1),
    ("the movie terrible", 0),
    ("I love the movie", 1),
    ("I hate the movie", 0),
    ("NLP amazing", 1),
    ("bugs awful", 0),
    ("excellent movie", 1),
    ("bad movie", 0),
]

# Tokenize and pad
def tokenize(text, vocab):
    return [vocab.get(word, 0) for word in text.split()]

max_length = 5

X_train = []
y_train = []

for text, label in training_data:
    tokens = tokenize(text, vocab)
    # Pad to max_length
    if len(tokens) < max_length:
        tokens = tokens + [0] * (max_length - len(tokens))
    else:
        tokens = tokens[:max_length]
    X_train.append(tokens)
    y_train.append(label)

X_train = np.array(X_train)
y_train = np.array(y_train)

print("Training Data:")
print("="*60)
for i, (text, label) in enumerate(training_data):
    sentiment = "POSITIVE" if label == 1 else "NEGATIVE"
    print(f"{text:20s} â†’ {sentiment:8s} | tokens: {X_train[i]}")
    
print(f"\nX_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")

## Part 5: Build RNN Models - Comparison

We'll create THREE models to compare:
1. **Learned embeddings** (baseline)
2. **Frozen Word2Vec** (trainable=False)
3. **Fine-tuned Word2Vec** (trainable=True)

In [None]:
try:
    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras import layers
    
    print(f"TensorFlow version: {tf.__version__}")
    
    # Model 1: Learn embeddings from scratch
    print("\nModel 1: LEARNED EMBEDDINGS (baseline)")
    print("="*60)
    
    model_learned = keras.Sequential([
        layers.Embedding(
            input_dim=vocab_size,
            output_dim=64,  # Smaller dimension, learned from scratch
            input_length=max_length,
            mask_zero=True,
            name='learned_embedding'
        ),
        layers.SimpleRNN(32, return_sequences=False),
        layers.Dense(16, activation='relu'),
        layers.Dense(1, activation='sigmoid')
    ], name='RNN_Learned_Embeddings')
    
    model_learned.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    
    print("âœ“ Embedding: 64-dim, learned during training")
    print("âœ“ Starts: Random vectors")
    print("âœ“ After training: Task-specific embeddings")
    
    # Model 2: Frozen Word2Vec
    print("\nModel 2: FROZEN WORD2VEC")
    print("="*60)
    
    model_frozen = keras.Sequential([
        layers.Embedding(
            input_dim=vocab_size,
            output_dim=embedding_dim,
            weights=[embedding_matrix],  # Pre-trained!
            input_length=max_length,
            trainable=False,  # FROZEN - won't update
            mask_zero=True,
            name='frozen_word2vec'
        ),
        layers.SimpleRNN(32, return_sequences=False),
        layers.Dense(16, activation='relu'),
        layers.Dense(1, activation='sigmoid')
    ], name='RNN_Frozen_Word2Vec')
    
    model_frozen.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    
    print("âœ“ Embedding: 300-dim, pre-trained Word2Vec")
    print("âœ“ trainable=False: Embeddings stay FIXED")
    print("âœ“ Only RNN and Dense layers train")
    
    # Model 3: Fine-tuned Word2Vec
    print("\nModel 3: FINE-TUNED WORD2VEC")
    print("="*60)
    
    model_finetuned = keras.Sequential([
        layers.Embedding(
            input_dim=vocab_size,
            output_dim=embedding_dim,
            weights=[embedding_matrix],  # Pre-trained!
            input_length=max_length,
            trainable=True,  # FINE-TUNE - will update
            mask_zero=True,
            name='finetuned_word2vec'
        ),
        layers.SimpleRNN(32, return_sequences=False),
        layers.Dense(16, activation='relu'),
        layers.Dense(1, activation='sigmoid')
    ], name='RNN_Finetuned_Word2Vec')
    
    model_finetuned.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    
    print("âœ“ Embedding: 300-dim, pre-trained Word2Vec")
    print("âœ“ trainable=True: Embeddings CAN update")
    print("âœ“ Starts with Word2Vec, adapts to your task")
    print("âœ“ ALL layers train")
    
    print("\n" + "="*60)
    print("Models created successfully!")
    
except ImportError:
    print("TensorFlow not available. Showing conceptual code only.")
    print("Install with: pip install tensorflow")

## Part 6: Compare Model Architectures

In [None]:
try:
    print("MODEL 1: Learned Embeddings")
    print("="*70)
    model_learned.summary()
    
    print("\n\nMODEL 2: Frozen Word2Vec")
    print("="*70)
    model_frozen.summary()
    
    print("\n\nMODEL 3: Fine-tuned Word2Vec")
    print("="*70)
    model_finetuned.summary()
    
    print("\n" + "="*70)
    print("Key Differences:")
    print("-"*70)
    print(f"Learned:    {model_learned.count_params():,} total params (all trainable)")
    print(f"Frozen:     {model_frozen.count_params():,} total params")
    print(f"            Only {model_frozen.count_params() - vocab_size * embedding_dim:,} trainable (RNN + Dense)")
    print(f"Fine-tuned: {model_finetuned.count_params():,} total params (all trainable)")
    
except:
    print("Models not available for summary")

## Part 7: Simulated Training Comparison

In [None]:
try:
    # Train all three models
    print("Training Models...")
    print("="*70)
    
    epochs = 50
    
    print("\n1. Training Learned Embeddings...")
    history_learned = model_learned.fit(
        X_train, y_train,
        epochs=epochs,
        verbose=0
    )
    print("   âœ“ Done")
    
    print("\n2. Training Frozen Word2Vec...")
    history_frozen = model_frozen.fit(
        X_train, y_train,
        epochs=epochs,
        verbose=0
    )
    print("   âœ“ Done")
    
    print("\n3. Training Fine-tuned Word2Vec...")
    history_finetuned = model_finetuned.fit(
        X_train, y_train,
        epochs=epochs,
        verbose=0
    )
    print("   âœ“ Done")
    
    print("\n" + "="*70)
    print("Final Training Results:")
    print("-"*70)
    print(f"Learned:    Accuracy = {history_learned.history['accuracy'][-1]:.4f}")
    print(f"Frozen:     Accuracy = {history_frozen.history['accuracy'][-1]:.4f}")
    print(f"Fine-tuned: Accuracy = {history_finetuned.history['accuracy'][-1]:.4f}")
    
except:
    print("Training simulation not available")
    # Create fake history for visualization
    history_learned = {'loss': list(np.linspace(0.7, 0.3, 50)), 'accuracy': list(np.linspace(0.5, 0.85, 50))}
    history_frozen = {'loss': list(np.linspace(0.5, 0.2, 50)), 'accuracy': list(np.linspace(0.6, 0.95, 50))}
    history_finetuned = {'loss': list(np.linspace(0.4, 0.15, 50)), 'accuracy': list(np.linspace(0.7, 0.98, 50))}

## Visualization: Training Comparison

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

try:
    # Plot 1: Loss
    ax1 = axes[0]
    ax1.plot(history_learned.history['loss'], label='Learned', linewidth=2, color='#E74C3C')
    ax1.plot(history_frozen.history['loss'], label='Frozen Word2Vec', linewidth=2, color='#3498DB')
    ax1.plot(history_finetuned.history['loss'], label='Fine-tuned Word2Vec', linewidth=2, color='#2ECC71')
    ax1.set_xlabel('Epoch', fontsize=12)
    ax1.set_ylabel('Loss', fontsize=12)
    ax1.set_title('Training Loss Comparison', fontsize=14, fontweight='bold')
    ax1.legend(fontsize=10)
    ax1.grid(True, alpha=0.3)
    
    # Plot 2: Accuracy
    ax2 = axes[1]
    ax2.plot(history_learned.history['accuracy'], label='Learned', linewidth=2, color='#E74C3C')
    ax2.plot(history_frozen.history['accuracy'], label='Frozen Word2Vec', linewidth=2, color='#3498DB')
    ax2.plot(history_finetuned.history['accuracy'], label='Fine-tuned Word2Vec', linewidth=2, color='#2ECC71')
    ax2.set_xlabel('Epoch', fontsize=12)
    ax2.set_ylabel('Accuracy', fontsize=12)
    ax2.set_title('Training Accuracy Comparison', fontsize=14, fontweight='bold')
    ax2.legend(fontsize=10)
    ax2.grid(True, alpha=0.3)
    
except:
    # Fallback with simulated data
    ax1 = axes[0]
    ax1.plot(history_learned['loss'], label='Learned', linewidth=2, color='#E74C3C')
    ax1.plot(history_frozen['loss'], label='Frozen Word2Vec', linewidth=2, color='#3498DB')
    ax1.plot(history_finetuned['loss'], label='Fine-tuned Word2Vec', linewidth=2, color='#2ECC71')
    ax1.set_xlabel('Epoch', fontsize=12)
    ax1.set_ylabel('Loss', fontsize=12)
    ax1.set_title('Training Loss Comparison (Simulated)', fontsize=14, fontweight='bold')
    ax1.legend(fontsize=10)
    ax1.grid(True, alpha=0.3)
    
    ax2 = axes[1]
    ax2.plot(history_learned['accuracy'], label='Learned', linewidth=2, color='#E74C3C')
    ax2.plot(history_frozen['accuracy'], label='Frozen Word2Vec', linewidth=2, color='#3498DB')
    ax2.plot(history_finetuned['accuracy'], label='Fine-tuned Word2Vec', linewidth=2, color='#2ECC71')
    ax2.set_xlabel('Epoch', fontsize=12)
    ax2.set_ylabel('Accuracy', fontsize=12)
    ax2.set_title('Training Accuracy Comparison (Simulated)', fontsize=14, fontweight='bold')
    ax2.legend(fontsize=10)
    ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('word2vec_rnn_training_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("Visualization saved as 'word2vec_rnn_training_comparison.png'")
print("\nTypical Patterns:")
print("  â€¢ Frozen Word2Vec: Often learns fastest (good initial embeddings)")
print("  â€¢ Fine-tuned: Best final performance (adapts embeddings to task)")
print("  â€¢ Learned: Slowest start (random initialization), may catch up with enough data")

## Part 8: Visualize Embedding Space

In [None]:
# Visualize Word2Vec embeddings in 2D
words_to_plot = ['love', 'hate', 'great', 'terrible', 'amazing', 'awful', 'excellent', 'bad']
indices = [vocab[word] for word in words_to_plot if word in vocab]
words = [word for word in words_to_plot if word in vocab]

# Get embeddings
embeddings_to_plot = embedding_matrix[indices]

# Reduce to 2D using PCA
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings_to_plot)

# Plot
fig, ax = plt.subplots(figsize=(10, 8))

# Color by sentiment
positive_words = ['love', 'great', 'amazing', 'excellent']
negative_words = ['hate', 'terrible', 'awful', 'bad']

colors = ['green' if word in positive_words else 'red' for word in words]

ax.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], 
          c=colors, s=200, alpha=0.6, edgecolors='black', linewidth=2)

for i, word in enumerate(words):
    ax.annotate(word, (embeddings_2d[i, 0], embeddings_2d[i, 1]),
               fontsize=12, fontweight='bold', ha='center', va='bottom')

ax.set_xlabel('PCA Component 1', fontsize=12)
ax.set_ylabel('PCA Component 2', fontsize=12)
ax.set_title('Word2Vec Embedding Space (2D Projection)\nGreen=Positive, Red=Negative', 
            fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)

# Add legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='green', label='Positive words'),
    Patch(facecolor='red', label='Negative words')
]
ax.legend(handles=legend_elements, loc='best', fontsize=10)

plt.tight_layout()
plt.savefig('word2vec_embedding_space.png', dpi=300, bbox_inches='tight')
plt.show()

print("Visualization saved as 'word2vec_embedding_space.png'")
print("\nNotice:")
print("  â€¢ Positive words cluster together (green)")
print("  â€¢ Negative words cluster together (red)")
print("  â€¢ This semantic structure helps the RNN learn faster!")

## Part 9: Real-World Usage Example

In [None]:
print("Real-World Code Example: Loading Google's Word2Vec")
print("="*70)
print("""
# Step 1: Install gensim
pip install gensim

# Step 2: Download Google's pre-trained Word2Vec
# https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM
# GoogleNews-vectors-negative300.bin.gz (~1.5 GB)

# Step 3: Load Word2Vec
from gensim.models import KeyedVectors

word2vec = KeyedVectors.load_word2vec_format(
    'GoogleNews-vectors-negative300.bin.gz',
    binary=True
)

# Step 4: Explore
print(word2vec['king'])  # 300-dim vector
print(word2vec.most_similar('king'))  # Similar words
# [('queen', 0.65), ('monarch', 0.58), ...]

# Step 5: Create embedding matrix (same as we did)
embedding_matrix = create_embedding_matrix(vocab, word2vec, 300)

# Step 6: Use in Keras
model = keras.Sequential([
    keras.layers.Embedding(
        input_dim=vocab_size,
        output_dim=300,
        weights=[embedding_matrix],
        trainable=False  # or True for fine-tuning
    ),
    keras.layers.LSTM(128),
    keras.layers.Dense(1, activation='sigmoid')
])
""")
print("="*70)

## Part 10: Decision Guide - When to Use What?

In [None]:
print("Decision Guide: Which Approach to Use?")
print("="*70)
print()

print("1. LEARNED EMBEDDINGS (from scratch)")
print("-"*70)
print("âœ“ Use when:")
print("  â€¢ Large dataset (100K+ examples)")
print("  â€¢ Very domain-specific vocabulary (medical, legal)")
print("  â€¢ Word meanings differ from general use")
print("  â€¢ Have GPU and training time")
print()
print("âœ— Avoid when:")
print("  â€¢ Small dataset (<10K examples)")
print("  â€¢ General domain text")
print("  â€¢ Limited resources")
print()

print("2. FROZEN WORD2VEC (trainable=False)")
print("-"*70)
print("âœ“ Use when:")
print("  â€¢ Small dataset (1K-10K examples)")
print("  â€¢ General domain (news, reviews, social media)")
print("  â€¢ Want fast training")
print("  â€¢ Limited compute resources")
print("  â€¢ Quick baseline/prototype")
print()
print("âœ— Avoid when:")
print("  â€¢ Very domain-specific text")
print("  â€¢ Many out-of-vocabulary words")
print()

print("3. FINE-TUNED WORD2VEC (trainable=True)")
print("-"*70)
print("âœ“ Use when:")
print("  â€¢ Medium dataset (10K-100K examples)")
print("  â€¢ Want best accuracy")
print("  â€¢ Have some compute resources")
print("  â€¢ Domain is somewhat specialized")
print("  â€¢ BEST OF BOTH WORLDS approach")
print()
print("âœ— Avoid when:")
print("  â€¢ Very small dataset (might overfit)")
print("  â€¢ Very limited resources")
print()

print("="*70)
print("\nRECOMMENDED WORKFLOW:")
print("  1. Start with FROZEN Word2Vec (quick baseline)")
print("  2. If accuracy is good â†’ Done! ðŸŽ‰")
print("  3. If accuracy is poor â†’ Try FINE-TUNING")
print("  4. If still poor â†’ Consider learning from scratch or more data")
print("="*70)

## Summary

### Key Takeaways:

**1. Word2Vec gives RNN a "head start"**
- Pre-trained on billions of words
- Already knows "love" â‰ˆ "great" and "hate" â‰ˆ "terrible"
- Faster training, better results with small data

**2. Three approaches:**
```python
# Learned (from scratch)
Embedding(vocab_size, 64)  # Random â†’ task-specific

# Frozen (fixed)
Embedding(vocab_size, 300, weights=[word2vec], trainable=False)

# Fine-tuned (adaptable)
Embedding(vocab_size, 300, weights=[word2vec], trainable=True)
```

**3. How to create embedding matrix:**
```python
embedding_matrix = np.zeros((vocab_size, 300))
for word, idx in vocab.items():
    if word in word2vec:
        embedding_matrix[idx] = word2vec[word]
    else:
        embedding_matrix[idx] = random_vector
```

**4. Real usage:**
```python
from gensim.models import KeyedVectors
word2vec = KeyedVectors.load_word2vec_format('GoogleNews.bin', binary=True)
# Then use in Keras as shown above
```

**5. Best practice:**
- Start with frozen Word2Vec (quick baseline)
- Fine-tune if needed for better accuracy
- Only train from scratch if you have lots of data

### Other Pre-trained Options:
- **GloVe** (Stanford): Similar to Word2Vec
- **FastText** (Facebook): Handles out-of-vocabulary better
- **BERT embeddings**: Contextual (more advanced)

### The Big Picture:
**Word2Vec + RNN** combines the best of both:
- Word2Vec: Rich semantic knowledge
- RNN: Sequential pattern learning
- Result: Better performance, especially with limited data!