# üéØ Part 1: Word Embedding Visualization

In this notebook, we'll explore how words are represented as vectors (embeddings) and visualize their relationships.

## What You'll Learn:
1. How to convert words to embeddings using pre-trained models
2. Visualize word relationships in 3D space using t-SNE
3. Measure similarity between words using cosine similarity

## üì¶ Install Dependencies

In [1]:
# Install required packages (run this in Colab)
!pip install sentence-transformers plotly seaborn scikit-learn -q


[notice] A new release of pip is available: 24.3.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## üìö Import Libraries

In [2]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity

print("‚úÖ All libraries imported successfully!")

  from .autonotebook import tqdm as notebook_tqdm


ImportError: DLL load failed while importing _C: The specified module could not be found.

---
# üé® PART 1: 3D Visualization of Word Embeddings

We'll define categories of words and see how they cluster together in embedding space.

## üìù Define Your Word List

Modify the dictionary below to explore different word relationships!

In [None]:
# ============================================================
# üîß CUSTOMIZE YOUR WORD LIST HERE!
# ============================================================
# Group words by category - words in the same category should cluster together

word_categories = {
    "üçé Fruits": [
        "apple", "banana", "orange", "mango", "strawberry", "grape"
    ],
    "üêæ Animals": [
        "dog", "cat", "elephant", "lion", "tiger", "rabbit"
    ],
    "üé® Colors": [
        "red", "blue", "green", "yellow", "purple", "orange"
    ],
    "üíª Technology": [
        "computer", "smartphone", "laptop", "tablet", "keyboard", "mouse"
    ],
    "üöó Vehicles": [
        "car", "bicycle", "motorcycle", "airplane", "train", "boat"
    ]
}

# Flatten the dictionary into lists
words = []
categories = []
for category, word_list in word_categories.items():
    words.extend(word_list)
    categories.extend([category] * len(word_list))

print(f"üìä Total words: {len(words)}")
print(f"üìÅ Categories: {list(word_categories.keys())}")

## ü§ñ Load Embedding Model & Generate Embeddings

In [None]:
# Load a pre-trained sentence transformer model
# This model converts text into 384-dimensional vectors
model_name = "all-MiniLM-L6-v2"
print(f"üîÑ Loading model: {model_name}...")

model = SentenceTransformer(model_name)
print(f"‚úÖ Model loaded! Embedding dimension: {model.get_sentence_embedding_dimension()}")

# Generate embeddings for all words
print("\nüîÑ Generating embeddings...")
embeddings = model.encode(words, show_progress_bar=True)

print(f"\n‚úÖ Embeddings shape: {embeddings.shape}")
print(f"   - {embeddings.shape[0]} words")
print(f"   - {embeddings.shape[1]} dimensions per word")

## üìâ Reduce Dimensions with t-SNE

We have 384-dimensional vectors, but we can only visualize in 3D. t-SNE helps us reduce dimensions while preserving relative distances.

In [None]:
# Apply t-SNE to reduce to 3 dimensions
print("üîÑ Applying t-SNE (this may take a moment)...")

# Adjust perplexity based on number of samples
perplexity = min(30, len(words) - 1)

tsne = TSNE(
    n_components=3,           # 3D output
    perplexity=perplexity,    # Balance between local and global structure
    random_state=42,          # For reproducibility
    n_iter=1000,              # Number of iterations
    learning_rate='auto',
    init='pca'                # Initialize with PCA for better results
)

embeddings_3d = tsne.fit_transform(embeddings)
print(f"‚úÖ Reduced to shape: {embeddings_3d.shape}")

## üåê Interactive 3D Visualization

In [None]:
# Create a DataFrame for Plotly
df = pd.DataFrame({
    'word': words,
    'category': categories,
    'x': embeddings_3d[:, 0],
    'y': embeddings_3d[:, 1],
    'z': embeddings_3d[:, 2]
})

# Create interactive 3D scatter plot
fig = px.scatter_3d(
    df,
    x='x', y='y', z='z',
    color='category',
    text='word',
    title='üìä Word Embeddings in 3D Space (t-SNE)',
    labels={'x': 't-SNE 1', 'y': 't-SNE 2', 'z': 't-SNE 3'},
    height=700,
    color_discrete_sequence=px.colors.qualitative.Set1
)

# Customize the plot
fig.update_traces(
    marker=dict(size=10, line=dict(width=1, color='white')),
    textposition='top center',
    textfont=dict(size=10)
)

fig.update_layout(
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=-0.15,
        xanchor="center",
        x=0.5
    ),
    scene=dict(
        xaxis_title='t-SNE Dimension 1',
        yaxis_title='t-SNE Dimension 2',
        zaxis_title='t-SNE Dimension 3'
    ),
    margin=dict(l=0, r=0, b=100, t=50)
)

fig.show()

print("\nüí° TIP: Drag to rotate, scroll to zoom, double-click to reset view!")

### üîç What Do You Notice?

- Do words from the same category cluster together?
- Are there any surprising relationships?
- What happens with words like "orange" (both fruit and color)?

---
# üìä PART 2: Cosine Similarity Heatmap

Now let's measure how similar each pair of words is using **cosine similarity**.

Cosine similarity = cos(Œ∏) between two vectors
- **1.0** = Same direction (most similar)
- **0.0** = Perpendicular (unrelated)  
- **-1.0** = Opposite direction (least similar)

In [None]:
# Calculate cosine similarity matrix
similarity_matrix = cosine_similarity(embeddings)

print(f"‚úÖ Similarity matrix shape: {similarity_matrix.shape}")
print(f"   (Each cell shows similarity between word i and word j)")

## üî• Interactive Heatmap

In [None]:
# Create heatmap with Plotly for interactivity
fig = go.Figure(data=go.Heatmap(
    z=similarity_matrix,
    x=words,
    y=words,
    colorscale='RdYlBu_r',  # Red = high similarity, Blue = low
    zmin=0,
    zmax=1,
    hovertemplate='%{x} ‚Üî %{y}<br>Similarity: %{z:.3f}<extra></extra>'
))

fig.update_layout(
    title='üî• Cosine Similarity Heatmap',
    xaxis_title='Words',
    yaxis_title='Words',
    height=800,
    width=900,
    xaxis={'tickangle': 45},
    yaxis={'autorange': 'reversed'}  # Put first word at top
)

fig.show()

print("\nüí° TIP: Hover over cells to see exact similarity values!")

## üìà Alternative: Seaborn Static Heatmap (with annotations)

In [None]:
# Create a static heatmap with annotations (if you prefer)
plt.figure(figsize=(16, 14))

# Create heatmap
sns.heatmap(
    similarity_matrix,
    xticklabels=words,
    yticklabels=words,
    cmap='RdYlBu_r',
    vmin=0,
    vmax=1,
    annot=len(words) <= 15,  # Show values only if not too many words
    fmt='.2f',
    square=True,
    cbar_kws={'label': 'Cosine Similarity'}
)

plt.title('Cosine Similarity Between Word Embeddings', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

## üèÜ Find Most & Least Similar Pairs

In [None]:
# Find most and least similar pairs (excluding self-comparisons)
n_words = len(words)
pairs = []

for i in range(n_words):
    for j in range(i + 1, n_words):  # Only upper triangle
        pairs.append({
            'word1': words[i],
            'word2': words[j],
            'similarity': similarity_matrix[i, j],
            'category1': categories[i],
            'category2': categories[j]
        })

pairs_df = pd.DataFrame(pairs).sort_values('similarity', ascending=False)

print("üèÜ TOP 10 MOST SIMILAR PAIRS:")
print("=" * 60)
for idx, row in pairs_df.head(10).iterrows():
    same_cat = "‚úÖ" if row['category1'] == row['category2'] else "‚ùå"
    print(f"{same_cat} {row['word1']:15} ‚Üî {row['word2']:15} : {row['similarity']:.4f}")

print("\n\n‚ùÑÔ∏è TOP 10 LEAST SIMILAR PAIRS:")
print("=" * 60)
for idx, row in pairs_df.tail(10).iterrows():
    same_cat = "‚úÖ" if row['category1'] == row['category2'] else "‚ùå"
    print(f"{same_cat} {row['word1']:15} ‚Üî {row['word2']:15} : {row['similarity']:.4f}")

---
## üéì Key Takeaways

1. **Embeddings capture meaning** - Similar words have similar vectors
2. **Dimensionality reduction** helps us visualize high-dimensional data
3. **Cosine similarity** is the standard metric for comparing embeddings
4. **Words can belong to multiple categories** (e.g., "orange")

## üöÄ Next Steps

Try modifying the `word_categories` dictionary to explore:
- Different semantic categories
- Synonyms and antonyms
- Domain-specific vocabulary

Then move on to **Notebook 2** to compare different embedding models!