# Word Embeddings in NLP

This notebook explores different types of word embeddings and their applications.

## What you'll learn:
- One-hot encoding vs. dense embeddings
- Word2Vec (CBOW and Skip-gram)
- GloVe embeddings
- Using pre-trained embeddings
- Visualizing word relationships
- Word similarity and analogies

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
import gensim
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## Understanding Word Representations
Let's start by understanding different ways to represent words.

In [None]:
# Sample vocabulary
vocabulary = ['king', 'queen', 'man', 'woman', 'boy', 'girl', 'prince', 'princess']
vocab_size = len(vocabulary)

print(f"Vocabulary: {vocabulary}")
print(f"Vocabulary size: {vocab_size}")

# Create word to index mapping
word_to_idx = {word: i for i, word in enumerate(vocabulary)}
idx_to_word = {i: word for i, word in enumerate(vocabulary)}

print(f"\nWord to index mapping: {word_to_idx}")

In [None]:
# One-hot encoding representation
def create_one_hot(word, word_to_idx, vocab_size):
    """
    Create one-hot vector for a word.
    """
    vector = np.zeros(vocab_size)
    if word in word_to_idx:
        vector[word_to_idx[word]] = 1
    return vector

# Example one-hot vectors
print("One-hot Encoding Examples:")
for word in ['king', 'queen', 'man']:
    one_hot = create_one_hot(word, word_to_idx, vocab_size)
    print(f"{word}: {one_hot}")

# Problems with one-hot encoding
king_vec = create_one_hot('king', word_to_idx, vocab_size)
queen_vec = create_one_hot('queen', word_to_idx, vocab_size)
man_vec = create_one_hot('man', word_to_idx, vocab_size)

# Calculate cosine similarity
king_queen_sim = cosine_similarity([king_vec], [queen_vec])[0][0]
king_man_sim = cosine_similarity([king_vec], [man_vec])[0][0]

print(f"\nCosine similarity (one-hot):")
print(f"King-Queen: {king_queen_sim:.3f}")
print(f"King-Man: {king_man_sim:.3f}")
print("Problem: All words are equally dissimilar!")

## Creating Sample Corpus
We'll create a sample corpus to train our word embeddings.

In [None]:
# Sample sentences about royalty and family
sample_sentences = [
    "the king rules the kingdom with wisdom",
    "the queen is elegant and powerful",
    "a man works hard every day",
    "a woman leads with grace and strength",
    "the boy plays in the garden",
    "the girl studies books diligently",
    "the prince will become king someday",
    "the princess is kind and smart",
    "the king and queen rule together",
    "the man and woman are married",
    "the boy and girl are siblings",
    "the prince and princess are royal",
    "a wise king makes good decisions",
    "a strong queen protects her people",
    "the young man works in the city",
    "the young woman teaches children",
    "the little boy loves to play",
    "the little girl enjoys reading",
    "the brave prince fights dragons",
    "the beautiful princess sings songs"
]

# Tokenize sentences
tokenized_sentences = [sentence.split() for sentence in sample_sentences]

print("Sample corpus:")
for i, sentence in enumerate(tokenized_sentences[:5]):
    print(f"{i+1}. {sentence}")
print(f"... and {len(tokenized_sentences)-5} more sentences")

print(f"\nTotal sentences: {len(tokenized_sentences)}")
print(f"Total words: {sum(len(sentence) for sentence in tokenized_sentences)}")

## Training Word2Vec Model
Word2Vec learns dense vector representations that capture semantic relationships.

In [None]:
# Train Word2Vec model
model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=50,  # Dimension of embeddings
    window=3,        # Context window size
    min_count=1,     # Minimum word frequency
    workers=1,       # Number of threads
    sg=0            # 0 for CBOW, 1 for Skip-gram
)

print("Word2Vec model trained successfully!")
print(f"Vocabulary size: {len(model.wv.key_to_index)}")
print(f"Vector dimensions: {model.wv.vector_size}")
print(f"Vocabulary: {list(model.wv.key_to_index.keys())}")

In [None]:
# Get word vectors
def get_word_vector(word, model):
    """Get vector for a word if it exists in vocabulary"""
    if word in model.wv.key_to_index:
        return model.wv[word]
    else:
        return None

# Example word vectors
king_vector = get_word_vector('king', model)
queen_vector = get_word_vector('queen', model)

print("Word2Vec Embeddings:")
print(f"King vector (first 10 dims): {king_vector[:10]}")
print(f"Queen vector (first 10 dims): {queen_vector[:10]}")
print(f"Vector shape: {king_vector.shape}")

## Exploring Word Similarities

In [None]:
# Calculate similarities between words
def calculate_similarity(word1, word2, model):
    """Calculate cosine similarity between two words"""
    try:
        return model.wv.similarity(word1, word2)
    except KeyError:
        return 0.0

# Word pairs to compare
word_pairs = [
    ('king', 'queen'),
    ('king', 'prince'),
    ('queen', 'princess'),
    ('man', 'woman'),
    ('boy', 'girl'),
    ('king', 'man'),
    ('queen', 'woman'),
    ('prince', 'boy'),
    ('princess', 'girl')
]

print("Word Similarities (Word2Vec):")
print("=" * 40)
for word1, word2 in word_pairs:
    similarity = calculate_similarity(word1, word2, model)
    print(f"{word1:>8} - {word2:<8}: {similarity:.3f}")

# Find most similar words
print("\nMost similar words:")
for word in ['king', 'queen', 'man', 'woman']:
    if word in model.wv.key_to_index:
        similar = model.wv.most_similar(word, topn=3)
        print(f"{word}: {similar}")

## Word Analogies
Testing the famous "king - man + woman = queen" analogy.

In [None]:
# Test word analogies
def test_analogy(word1, word2, word3, model, topn=3):
    """
    Test analogy: word1 is to word2 as word3 is to ?
    """
    try:
        result = model.wv.most_similar(positive=[word2, word3], negative=[word1], topn=topn)
        return result
    except KeyError as e:
        return f"Word not in vocabulary: {e}"

# Test analogies
analogies = [
    ('king', 'man', 'queen'),      # king - man + queen = ?
    ('man', 'king', 'woman'),      # man - king + woman = ?
    ('prince', 'boy', 'princess'), # prince - boy + princess = ?
    ('boy', 'prince', 'girl')      # boy - prince + girl = ?
]

print("Word Analogies:")
print("=" * 50)
for word1, word2, word3 in analogies:
    result = test_analogy(word1, word2, word3, model)
    print(f"{word1} - {word2} + {word3} = {result}")
    print("-" * 30)

## Visualizing Word Embeddings

In [None]:
# Get all word vectors
words = list(model.wv.key_to_index.keys())
vectors = [model.wv[word] for word in words]

# Reduce dimensionality using PCA
pca = PCA(n_components=2)
vectors_2d = pca.fit_transform(vectors)

# Create DataFrame for plotting
df_plot = pd.DataFrame({
    'word': words,
    'x': vectors_2d[:, 0],
    'y': vectors_2d[:, 1]
})

# Plot embeddings
plt.figure(figsize=(12, 8))
plt.scatter(df_plot['x'], df_plot['y'], alpha=0.7, s=100)

# Add word labels
for _, row in df_plot.iterrows():
    plt.annotate(row['word'], (row['x'], row['y']), 
                xytext=(5, 5), textcoords='offset points', fontsize=10)

plt.title('Word Embeddings Visualization (PCA)')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.grid(True, alpha=0.3)
plt.show()

print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total explained variance: {sum(pca.explained_variance_ratio_):.3f}")

In [None]:
# Alternative visualization with t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
vectors_tsne = tsne.fit_transform(vectors)

df_tsne = pd.DataFrame({
    'word': words,
    'x': vectors_tsne[:, 0],
    'y': vectors_tsne[:, 1]
})

plt.figure(figsize=(12, 8))
plt.scatter(df_tsne['x'], df_tsne['y'], alpha=0.7, s=100, c='red')

for _, row in df_tsne.iterrows():
    plt.annotate(row['word'], (row['x'], row['y']), 
                xytext=(5, 5), textcoords='offset points', fontsize=10)

plt.title('Word Embeddings Visualization (t-SNE)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.grid(True, alpha=0.3)
plt.show()

## Comparing CBOW vs Skip-gram
Let's train both architectures and compare them.

In [None]:
# Train CBOW model (sg=0)
cbow_model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=50,
    window=3,
    min_count=1,
    workers=1,
    sg=0  # CBOW
)

# Train Skip-gram model (sg=1)
skipgram_model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=50,
    window=3,
    min_count=1,
    workers=1,
    sg=1  # Skip-gram
)

print("Both models trained successfully!")

# Compare similarities
test_pairs = [('king', 'queen'), ('man', 'woman'), ('boy', 'girl')]

print("\nComparison of CBOW vs Skip-gram:")
print("=" * 50)
print(f"{'Word Pair':<15} {'CBOW':<10} {'Skip-gram':<10}")
print("-" * 40)

for word1, word2 in test_pairs:
    cbow_sim = calculate_similarity(word1, word2, cbow_model)
    skipgram_sim = calculate_similarity(word1, word2, skipgram_model)
    print(f"{word1}-{word2:<10} {cbow_sim:<10.3f} {skipgram_sim:<10.3f}")

## Using Pre-trained Embeddings (Demo)
This section shows how to load and use pre-trained embeddings.

In [None]:
# Note: This would normally load pre-trained embeddings
# For demonstration, we'll show the process

print("Loading Pre-trained Embeddings (Demo):")
print("=" * 40)

# Example of how to load Google's Word2Vec model
print("To load Google's pre-trained Word2Vec model:")
print("1. Download: GoogleNews-vectors-negative300.bin.gz")
print("2. Code: model = KeyedVectors.load_word2vec_format('path/to/file', binary=True)")
print("3. Usage: model['word'] or model.similarity('word1', 'word2')")

print("\nTo load GloVe embeddings:")
print("1. Download: glove.6B.300d.txt")
print("2. Code: model = KeyedVectors.load_word2vec_format('path/to/file', binary=False)")

print("\nAdvantages of pre-trained embeddings:")
print("- Trained on large corpora (billions of words)")
print("- Capture general semantic relationships")
print("- Ready to use without training time")
print("- Often work better than custom embeddings on small datasets")

## Creating a Simple Word Embedding Lookup

In [None]:
class SimpleEmbeddingLookup:
    """
    Simple class to handle word embeddings lookup and operations.
    """
    
    def __init__(self, model):
        self.model = model
        self.vocab = set(model.wv.key_to_index.keys())
    
    def get_vector(self, word):
        """Get vector for a word"""
        if word in self.vocab:
            return self.model.wv[word]
        else:
            return None
    
    def similarity(self, word1, word2):
        """Calculate similarity between two words"""
        try:
            return self.model.wv.similarity(word1, word2)
        except KeyError:
            return 0.0
    
    def most_similar(self, word, topn=5):
        """Find most similar words"""
        if word in self.vocab:
            return self.model.wv.most_similar(word, topn=topn)
        else:
            return []
    
    def analogy(self, word1, word2, word3, topn=3):
        """Solve word analogy: word1 is to word2 as word3 is to ?"""
        try:
            return self.model.wv.most_similar(positive=[word2, word3], negative=[word1], topn=topn)
        except KeyError:
            return []
    
    def word_in_vocab(self, word):
        """Check if word is in vocabulary"""
        return word in self.vocab

# Create embedding lookup instance
embedding_lookup = SimpleEmbeddingLookup(model)

# Test the lookup
print("Testing Embedding Lookup:")
print("=" * 30)

# Test similarity
sim = embedding_lookup.similarity('king', 'queen')
print(f"King-Queen similarity: {sim:.3f}")

# Test most similar
similar = embedding_lookup.most_similar('king')
print(f"Most similar to 'king': {similar}")

# Test analogy
analogy_result = embedding_lookup.analogy('king', 'man', 'queen')
print(f"King - man + queen = {analogy_result}")

# Test vocabulary check
print(f"'king' in vocabulary: {embedding_lookup.word_in_vocab('king')}")
print(f"'elephant' in vocabulary: {embedding_lookup.word_in_vocab('elephant')}")

## Practical Applications
Examples of how to use word embeddings in real applications.

In [None]:
# Application 1: Document Similarity using average word embeddings
def document_vector(doc, model):
    """
    Create document vector by averaging word embeddings.
    """
    words = doc.split()
    word_vectors = []
    
    for word in words:
        if word in model.wv.key_to_index:
            word_vectors.append(model.wv[word])
    
    if word_vectors:
        return np.mean(word_vectors, axis=0)
    else:
        return np.zeros(model.wv.vector_size)

# Test documents
docs = [
    "the king rules the kingdom",
    "the queen is very wise",
    "the man works hard",
    "the woman is strong"
]

# Create document vectors
doc_vectors = [document_vector(doc, model) for doc in docs]

# Calculate document similarities
print("Document Similarity Matrix:")
print("=" * 40)

similarity_matrix = cosine_similarity(doc_vectors)

for i, doc1 in enumerate(docs):
    for j, doc2 in enumerate(docs):
        if i < j:  # Only show upper triangle
            sim = similarity_matrix[i][j]
            print(f"Doc {i+1} vs Doc {j+1}: {sim:.3f}")
            print(f"  '{doc1}' vs '{doc2}'")
            print()

In [None]:
# Application 2: Word clustering based on embeddings
from sklearn.cluster import KMeans

# Get vectors for clustering
words_to_cluster = ['king', 'queen', 'prince', 'princess', 'man', 'woman', 'boy', 'girl']
vectors_to_cluster = [model.wv[word] for word in words_to_cluster if word in model.wv.key_to_index]
valid_words = [word for word in words_to_cluster if word in model.wv.key_to_index]

# Perform k-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(vectors_to_cluster)

# Show clustering results
print("Word Clustering Results:")
print("=" * 30)

cluster_dict = {}
for word, cluster in zip(valid_words, clusters):
    if cluster not in cluster_dict:
        cluster_dict[cluster] = []
    cluster_dict[cluster].append(word)

for cluster_id, words in cluster_dict.items():
    print(f"Cluster {cluster_id}: {words}")

## Key Takeaways

1. **Dense representations** capture semantic relationships better than one-hot encoding
2. **Word2Vec** learns embeddings by predicting context (CBOW) or target words (Skip-gram)
3. **Similar words** have similar vector representations
4. **Analogies** can be solved using vector arithmetic
5. **Pre-trained embeddings** often work better than training from scratch
6. **Applications** include document similarity, clustering, and as features for ML models

## Next Steps

- Explore FastText embeddings (subword information)
- Try contextualized embeddings (BERT, ELMo)
- Use embeddings as features in downstream tasks
- Experiment with different embedding dimensions
- Learn about sentence and document embeddings