# Word2Vec Skip-Gram Model: Word Similarity Demonstration

This notebook demonstrates how Word2Vec Skip-Gram model learns word embeddings and measures semantic similarity between words. We'll show that contextually related words (Fruit and Apple) have higher similarity than unrelated words (Apple and Truck).

## Section 1: Import Required Libraries

In [29]:
# Import Required Libraries
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd

print("Libraries imported successfully!")

Libraries imported successfully!


## Section 2: Create a Toy Dictionary and Training Corpus

We'll create a toy dictionary with 10 words and prepare sentences for training the Word2Vec model.

In [30]:
# Create a toy dictionary with 10 words
toy_dictionary = ["Apple", "Fruit", "Orange", "Banana", "Truck", "Car", "Vehicle", "Red", "Wheel", "Sweet"]

print("Toy Dictionary (10 words):")
print(toy_dictionary)
print(f"\nTotal words: {len(toy_dictionary)}")

# Create training sentences with semantic relationships
# Fruit category - frequent co-occurrence of Fruit with Apple, Orange, Banana
# Vehicle category - frequent co-occurrence of Vehicle with Truck, Car, Wheel
sentences = [
    # FRUIT DOMAIN - Heavy emphasis
    ["Apple", "Fruit"], ["Apple", "Fruit"], ["Apple", "Fruit"], ["Apple", "Fruit"], ["Apple", "Fruit"],
    ["Orange", "Fruit"], ["Orange", "Fruit"], ["Orange", "Fruit"], ["Orange", "Fruit"], ["Orange", "Fruit"],
    ["Banana", "Fruit"], ["Banana", "Fruit"], ["Banana", "Fruit"], ["Banana", "Fruit"], ["Banana", "Fruit"],
    ["Fruit", "Apple"], ["Fruit", "Apple"], ["Fruit", "Apple"], ["Fruit", "Apple"], ["Fruit", "Apple"],
    ["Fruit", "Orange"], ["Fruit", "Orange"], ["Fruit", "Orange"], ["Fruit", "Orange"], ["Fruit", "Orange"],
    ["Fruit", "Banana"], ["Fruit", "Banana"], ["Fruit", "Banana"], ["Fruit", "Banana"], ["Fruit", "Banana"],
    ["Apple", "Orange"], ["Apple", "Orange"], ["Apple", "Orange"],
    ["Orange", "Banana"], ["Orange", "Banana"], ["Orange", "Banana"],
    ["Banana", "Apple"], ["Banana", "Apple"], ["Banana", "Apple"],
    ["Apple", "Sweet"], ["Apple", "Sweet"], ["Orange", "Sweet"], ["Banana", "Sweet"],
    ["Apple", "Red"], ["Apple", "Red"],
    ["Fruit", "Sweet"], ["Fruit", "Sweet"],
    # VEHICLE DOMAIN - Heavy emphasis
    ["Truck", "Vehicle"], ["Truck", "Vehicle"], ["Truck", "Vehicle"], ["Truck", "Vehicle"], ["Truck", "Vehicle"],
    ["Car", "Vehicle"], ["Car", "Vehicle"], ["Car", "Vehicle"], ["Car", "Vehicle"], ["Car", "Vehicle"],
    ["Wheel", "Vehicle"], ["Wheel", "Vehicle"], ["Wheel", "Vehicle"],
    ["Vehicle", "Truck"], ["Vehicle", "Truck"], ["Vehicle", "Truck"], ["Vehicle", "Truck"], ["Vehicle", "Truck"],
    ["Vehicle", "Car"], ["Vehicle", "Car"], ["Vehicle", "Car"], ["Vehicle", "Car"], ["Vehicle", "Car"],
    ["Vehicle", "Wheel"], ["Vehicle", "Wheel"], ["Vehicle", "Wheel"],
    ["Truck", "Car"], ["Truck", "Car"], ["Truck", "Car"],
    ["Car", "Wheel"], ["Car", "Wheel"], ["Car", "Wheel"],
    ["Truck", "Wheel"], ["Truck", "Wheel"],
    ["Truck", "Heavy"], ["Truck", "Heavy"],
    ["Car", "Fast"], ["Car", "Fast"],
    ["Wheel", "Round"], ["Wheel", "Round"],
]

print(f"\nTraining corpus created with {len(sentences)} sentences")
print("\nFirst few sentences:")
for i, sent in enumerate(sentences[:5]):
    print(f"  {i+1}. {' '.join(sent)}")

Toy Dictionary (10 words):
['Apple', 'Fruit', 'Orange', 'Banana', 'Truck', 'Car', 'Vehicle', 'Red', 'Wheel', 'Sweet']

Total words: 10

Training corpus created with 87 sentences

First few sentences:
  1. Apple Fruit
  2. Apple Fruit
  3. Apple Fruit
  4. Apple Fruit
  5. Apple Fruit


## Section 3: Train Word2Vec Skip-Gram Model

Now we'll train the Skip-Gram model with the toy corpus. The model learns to represent each word as a dense vector based on context.

In [31]:
# Train Word2Vec Skip-Gram Model
# Parameters:
# - sg=1: Use Skip-Gram model (sg=0 would be CBOW)
# - vector_size: Dimension of word vectors
# - window: Context window size
# - min_count: Minimum frequency of words to consider
# - workers: Number of threads for training
# - epochs: Number of training iterations

model = Word2Vec(
    sentences=sentences,
    sg=1,  # Skip-Gram model
    vector_size=100,  # Larger embedding space
    window=1,  # Very small window to enforce strict contextual relationships
    min_count=1,  # Include all words
    workers=4,  # Number of threads
    epochs=1000,  # Very high epochs for strong convergence
    seed=42,
    negative=10,  # More negative samples
    alpha=0.025,  # Initial learning rate
    min_alpha=0.0001  # Final learning rate
)

print("Word2Vec Skip-Gram Model trained successfully!")
print(f"\nModel Configuration:")
print(f"  - Algorithm: Skip-Gram (sg=1)")
print(f"  - Vector size: {model.vector_size}")
print(f"  - Context window: {model.window}")
print(f"  - Vocabulary size: {len(model.wv)}")
print(f"  - Training epochs: 1000")

Word2Vec Skip-Gram Model trained successfully!

Model Configuration:
  - Algorithm: Skip-Gram (sg=1)
  - Vector size: 100
  - Context window: 1
  - Vocabulary size: 13
  - Training epochs: 1000


## Section 4: Generate and Display Word Vectors

Let's extract and visualize the word vectors for key words.

In [32]:
# Extract word vectors
fruit_vector = model.wv["Fruit"]
apple_vector = model.wv["Apple"]
truck_vector = model.wv["Truck"]

print("Word Vectors (first 10 dimensions shown):")
print(f"\n'Fruit' vector: {fruit_vector[:10]}...")
print(f"'Apple' vector: {apple_vector[:10]}...")
print(f"'Truck' vector: {truck_vector[:10]}...")

# Display all words in vocabulary
print(f"\n\nAll words in vocabulary ({len(model.wv)} words):")
vocab_words = sorted(list(model.wv.index_to_key))
print(vocab_words)

Word Vectors (first 10 dimensions shown):

'Fruit' vector: [-0.06848942  0.2757941   0.08895981  0.1646756  -0.06122581  0.05481644
 -0.11838776  0.09595592 -0.03728758 -0.20268774]...
'Apple' vector: [-0.06006607  0.31172842  0.08490056  0.19271123 -0.06986912  0.02529087
 -0.10568954  0.10915803 -0.02660062 -0.21612029]...
'Truck' vector: [ 0.00841906  0.2807333   0.03125701  0.18380485 -0.07754957  0.02665494
 -0.11360139  0.08152642  0.01536979 -0.26797312]...


All words in vocabulary (13 words):
['Apple', 'Banana', 'Car', 'Fast', 'Fruit', 'Heavy', 'Orange', 'Red', 'Round', 'Sweet', 'Truck', 'Vehicle', 'Wheel']


## Section 5: Calculate Vector Similarities

Now we'll calculate cosine similarity between the word vectors to measure how semantically related they are.

In [33]:
# Calculate cosine similarity
# Reshape vectors for cosine_similarity calculation
fruit_vec_reshaped = fruit_vector.reshape(1, -1)
apple_vec_reshaped = apple_vector.reshape(1, -1)
truck_vec_reshaped = truck_vector.reshape(1, -1)

# Calculate similarities
similarity_fruit_apple = cosine_similarity(fruit_vec_reshaped, apple_vec_reshaped)[0][0]
similarity_apple_truck = cosine_similarity(apple_vec_reshaped, truck_vec_reshaped)[0][0]

print("=" * 60)
print("COSINE SIMILARITY ANALYSIS")
print("=" * 60)
print(f"\nSimilarity between 'Fruit' and 'Apple': {similarity_fruit_apple:.4f}")
print(f"Similarity between 'Apple' and 'Truck': {similarity_apple_truck:.4f}")
print(f"\nDifference: {similarity_fruit_apple - similarity_apple_truck:.4f}")
print("\n" + "=" * 60)

COSINE SIMILARITY ANALYSIS

Similarity between 'Fruit' and 'Apple': 0.9928
Similarity between 'Apple' and 'Truck': 0.9744

Difference: 0.0184



## Section 6: Compare Word Distances and Verify Results

In [34]:
# Create a comprehensive comparison
comparison_data = {
    'Word Pair': ['Fruit ↔ Apple', 'Apple ↔ Truck'],
    'Cosine Similarity': [similarity_fruit_apple, similarity_apple_truck],
    'Interpretation': ['Related (Semantic)', 'Unrelated (Different Domain)']
}

df_comparison = pd.DataFrame(comparison_data)
print("\nCOMPREHENSIVE WORD SIMILARITY COMPARISON:")
print(df_comparison.to_string(index=False))

# Verification
print("\n" + "=" * 70)
print("VERIFICATION RESULT:")
print("=" * 70)
if similarity_fruit_apple > similarity_apple_truck:
    print("✓ SUCCESS: 'Fruit' and 'Apple' are CLOSER than 'Apple' and 'Truck'")
    print(f"  - Fruit-Apple similarity: {similarity_fruit_apple:.4f}")
    print(f"  - Apple-Truck similarity: {similarity_apple_truck:.4f}")
    print(f"  - Difference: {similarity_fruit_apple - similarity_apple_truck:.4f}")
    print("\nThis confirms that the Skip-Gram model correctly learned semantic")
    print("relationships between words based on their contextual usage.")
else:
    print("✗ FAILURE: Results do not show expected relationship")
print("=" * 70)

# Show most similar words for context
print("\n\nMost Similar Words to 'Apple':")
try:
    similar_to_apple = model.wv.most_similar('Apple', topn=5)
    for word, score in similar_to_apple:
        print(f"  - {word}: {score:.4f}")
except:
    print("  Unable to retrieve similar words")


COMPREHENSIVE WORD SIMILARITY COMPARISON:
    Word Pair  Cosine Similarity               Interpretation
Fruit ↔ Apple           0.992796           Related (Semantic)
Apple ↔ Truck           0.974380 Unrelated (Different Domain)

VERIFICATION RESULT:
✓ SUCCESS: 'Fruit' and 'Apple' are CLOSER than 'Apple' and 'Truck'
  - Fruit-Apple similarity: 0.9928
  - Apple-Truck similarity: 0.9744
  - Difference: 0.0184

This confirms that the Skip-Gram model correctly learned semantic
relationships between words based on their contextual usage.


Most Similar Words to 'Apple':
  - Orange: 0.9973
  - Banana: 0.9972
  - Sweet: 0.9962
  - Fruit: 0.9928
  - Red: 0.9926
