# Understanding Text Embeddings: A Beginner's Guide

This notebook shows you how to turn words and sentences into numbers that computers can understand, then visualize them on a simple 2D plot.

## What Are Text Embeddings?

**Text embeddings** convert words or sentences into lists of numbers. Think of it like giving each word coordinates on a map - similar words get similar coordinates and end up close together.

**Why is this useful?**
- Words like "dog" and "cat" will be close together (both are pets)
- Words like "hot" and "cold" will be far apart (opposites)
- This helps computers understand language better

## Step 1: Install and Import What We Need

In [None]:
# Install required packages (uncomment if needed)
# !pip install sentence-transformers scikit-learn matplotlib pandas seaborn

In [None]:
# Import the tools we need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE

# Make our plots look nice
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 8)

## Step 2: Visualizing Word Embeddings

Let's start with simple words from different categories and see how they group together.

In [None]:
# Load a model that can convert text to numbers
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Model loaded successfully!")

In [None]:
# Choose words from different categories
words = [
    # Animals
    "dog", "cat", "horse", "lion", "elephant",
    # Fruits  
    "apple", "banana", "orange", "strawberry", "watermelon",
    # Countries
    "usa", "canada", "france", "japan", "australia", 
    # Technology
    "computer", "phone", "internet", "software", "data"
]

# Labels for our categories (same order as words above)
categories = ["Animal"]*5 + ["Fruit"]*5 + ["Country"]*5 + ["Technology"]*5

print(f"We have {len(words)} words to visualize")

In [None]:
# Convert words to numbers (embeddings)
word_embeddings = model.encode(words)
print(f"Each word is now represented by {word_embeddings.shape[1]} numbers")
print(f"That's too many dimensions to visualize, so we'll reduce it to 2D")

it may be easier first to visualize them on https://projector.tensorflow.org/. Let save our data so that we can upload them on this website.

In [None]:
# Save word embeddings as vectors file (TSV format)
word_vectors_df = pd.DataFrame(word_embeddings)
word_vectors_df.to_csv('word_vectors.tsv', sep='\t', header=False, index=False)
print("✓ Saved word_vectors.tsv")

# Save word labels and categories as metadata file  
word_metadata_df = pd.DataFrame({
    'Label': words,
    'Category': categories
})
word_metadata_df.to_csv('word_metadata.tsv', sep='\t', index=False)
print("✓ Saved word_metadata.tsv")

Now, let us visualize the embeddings with t-SNE here.

In [None]:
# Reduce from many dimensions to just 2 (so we can plot it)
# t-SNE is a technique that keeps similar items close together
tsne = TSNE(n_components=2, perplexity=5, random_state=42)
word_2d = tsne.fit_transform(word_embeddings)
print("Successfully reduced to 2D coordinates!")

In [None]:
# Create a simple plot
plt.figure(figsize=(12, 8))

# Choose colors for each category
colors = {"Animal": "blue", "Fruit": "orange", "Country": "green", "Technology": "red"}

# Plot each category with its own color
for category in colors:
    # Find words that belong to this category
    mask = [cat == category for cat in categories]
    category_points = word_2d[mask]
    
    # Plot them
    plt.scatter(category_points[:, 0], category_points[:, 1], 
               c=colors[category], label=category, alpha=0.7, s=100)

# Add word labels to each point
for i, word in enumerate(words):
    plt.annotate(word, (word_2d[i, 0], word_2d[i, 1]), 
                xytext=(5, 5), textcoords='offset points', fontsize=10)

plt.title("Word Embeddings Visualization", fontsize=16)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("\nWhat to notice:")
print("- Words from the same category should be close together")
print("- Similar words should be near each other")
print("- Different categories should be in different areas")

## Step 3: Visualizing Sentence Embeddings

Now let's try with full sentences instead of just words.

In [None]:
# Create sentences about different topics
sentences = [
    # Technology
    "I love using my smartphone every day",
    "Computers make our work much easier", 
    "The internet connects people worldwide",
    
    # Food
    "Pizza is my favorite dinner",
    "Fresh fruit tastes amazing in summer",
    "Cooking at home saves money",
    
    # Sports  
    "Playing soccer keeps me healthy",
    "Swimming is great exercise for your whole body",
    "I enjoy watching basketball games",
    
    # Weather
    "It's raining heavily outside today",
    "Sunny days make me feel happy", 
    "Winter snow is beautiful but cold"
]

# Labels for sentence topics
sentence_topics = ["Technology"]*3 + ["Food"]*3 + ["Sports"]*3 + ["Weather"]*3

print(f"We have {len(sentences)} sentences to visualize")

In [None]:
# Convert sentences to numbers
sentence_embeddings = model.encode(sentences)
print(f"Each sentence is now represented by {sentence_embeddings.shape[1]} numbers")

In [None]:
# Save sentence embeddings as vectors file
sentence_vectors_df = pd.DataFrame(sentence_embeddings)
sentence_vectors_df.to_csv('sentence_vectors.tsv', sep='\t', header=False, index=False)
print("✓ Saved sentence_vectors.tsv")

# Save sentence labels and categories as metadata file
sentence_labels = [f"S{i+1}: {sent}" for i, sent in enumerate(sentences)]
sentence_metadata_df = pd.DataFrame({
    'Label': sentence_labels,
    'Category': sentence_topics
})
sentence_metadata_df.to_csv('sentence_metadata.tsv', sep='\t', index=False)
print("✓ Saved sentence_metadata.tsv")

In [None]:
# Reduce to 2D for plotting
tsne = TSNE(n_components=2, perplexity=4, random_state=42)
sentence_2d = tsne.fit_transform(sentence_embeddings)
print("Reduced sentences to 2D coordinates!")

In [None]:
# Plot the sentences
plt.figure(figsize=(12, 10))

# Colors for each topic
topic_colors = {"Technology": "blue", "Food": "orange", "Sports": "green", "Weather": "purple"}

# Plot each topic with its own color
for topic in topic_colors:
    mask = [t == topic for t in sentence_topics]
    topic_points = sentence_2d[mask]
    
    plt.scatter(topic_points[:, 0], topic_points[:, 1], 
               c=topic_colors[topic], label=topic, alpha=0.7, s=120)

# Add numbers to each point
for i in range(len(sentences)):
    plt.annotate(str(i+1), (sentence_2d[i, 0], sentence_2d[i, 1]), 
                xytext=(0, 0), textcoords='offset points', 
                fontsize=12, fontweight='bold', ha='center', va='center')

plt.title("Sentence Embeddings Visualization", fontsize=16)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Show which number corresponds to which sentence
print("\nSentence Reference:")
for i, sentence in enumerate(sentences):
    print(f"{i+1}. {sentence}")

print("\nWhat to notice:")
print("- Sentences about the same topic should cluster together")
print("- Similar sentences should be close to each other")
print("- The model understands meaning, not just words")

## Step 4: Try Your Own Words or Sentences!

Now you can experiment with your own text:

In [None]:
# Add your own words here!
my_words = [
    "happy", "sad", "angry", "excited",  # emotions
    "red", "blue", "green", "yellow",    # colors
    "big", "small", "huge", "tiny"       # sizes
]

my_categories = ["Emotion"]*4 + ["Color"]*4 + ["Size"]*4

# Convert to embeddings and reduce to 2D
my_embeddings = model.encode(my_words)
tsne = TSNE(n_components=2, perplexity=3, random_state=42)
my_2d = tsne.fit_transform(my_embeddings)

# Simple plot
plt.figure(figsize=(10, 8))
colors = {"Emotion": "red", "Color": "blue", "Size": "green"}

for category in colors:
    mask = [cat == category for cat in my_categories]
    points = my_2d[mask]
    plt.scatter(points[:, 0], points[:, 1], c=colors[category], label=category, s=100)

for i, word in enumerate(my_words):
    plt.annotate(word, (my_2d[i, 0], my_2d[i, 1]), 
                xytext=(5, 5), textcoords='offset points')

plt.title("My Custom Word Embeddings")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## What We Learned

1. **Text embeddings** turn words and sentences into numbers
2. **Similar meanings** result in similar numbers  
3. **t-SNE** helps us visualize high-dimensional data in 2D
4. **Clustering** shows us how the AI groups related concepts

### Try This Next:
- Change the words in the examples above
- Add more categories 
- Try sentences in different languages
- See what happens with very similar sentences