# Understanding Embeddings


##
---
---

## üìö  Introduction

Welcome to this comprehensive guide on **embeddings** - one of the fundamental concepts in modern natural language processing and machine learning.

### What Are Embeddings?

Embeddings are high-dimensional **vector representations** of text that transform words, sentences, or documents into numerical arrays that capture semantic meaning->

**Key Properties:**
- Similar meanings -> closer vectors in space
- Different meanings -> vectors farther apart
- Used in: semantic search, clustering, classification, and RAG systems
<br><br>

##
---
---

## üé® A Simple 2D World

To understand embeddings intuitively, let's imagine a simplified **2-dimensional world** with two axes:

- **X-axis = Fruit-ness** (how much the word relates to fruits)
- **Y-axis = Sweetness** (how sweet the item is)

### Example Mappings

| Word | Coordinates | Quadrant |
|------|-------------|----------|
| üçé apple | (0.9, 0.7) | 1st |
| üçå banana | (0.5, 0.8) | 1st |
| üíª operating system | (-0.2, -0.4) | 3rd |

Both **apple** and **banana** lie in the first quadrant because they're fruits with sweetness. The **operating system** is in a completely different region since it's neither a fruit nor sweet.

![Embedding Space](../../data/photos/embedding_space.png)

> **Note:** While this example uses 2 dimensions for visualization, real-world embedding models typically operate in much higher dimensions (384-1024+) to capture complex relationships.

##
---
---

## üìê Measuring Similarity: The Distance Problem

Now that we've introduced embeddings, let's talk about how we use them in practice. Embeddings have a wide range of applications: clustering, classification, semantic search, and more. From a RAG perspective, embeddings help us retrieve information that's similar to or shares meaning with the user's query.

But how exactly do we measure how similar two pieces of text are using embeddings? Let's try out a few approaches.

---

### First Approach: Quadrant-Based

**Intuition:** They are in the same quadrant... so they must be similar?? <br><br>
Apple and banana both lie in the first quadrant, so they must be similar while Operating system is in the opposite quadrant, so it feels unrelated.

**Problem:** This gives us a very rough sense of similarity, but it's not sensitive enough.


###
---

### Second Approach: Euclidean Distance


**Intuition:** Let's measure the distance between the points. <br>
Distance feels natural:
- If two points are close -> similar
- If two points are far  -> different

The most common way to calculate distance is to use the **Euclidean distance formula**:
$$\text{Distance}(\vec{a}, \vec{b}) = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}$$

Let's calculate:
- Distance(apple, banana) = 0.41 (close) ‚úÖ
- Distance(apple, operating system) = 1.5 (far) ‚úÖ

This works well initially...

###
---

### The Problem with Distance

Consider adding a **green apple** at coordinates (0.45, 0.35).

Now let's calculate distances:
- Distance(apple, banana) = 0.41
- Distance(apple, green apple) = 0.57 ‚ùå

**Problem:** According to Euclidean distance, an apple is more similar to a banana than to a green apple! This doesn't make semantic sense.

**Why?** The green apple lies in the same *direction* as apple, just with smaller magnitude. Distance alone doesn't capture this directional similarity.

![Green Apple Problem](../../data/photos/embedding_space_with_green_apple.png)


###
---

### The Solution: Cosine Similarity

**Cosine similarity** measures the angle between vectors, that is it captures the direction not their distance:

$$\text{cosine\_similarity}(\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{||\vec{a}|| \times ||\vec{b}||}$$

**Range:** -1 to 1
- 1 = identical direction (very similar)
- 0 = perpendicular (unrelated)
- -1 = opposite direction (opposite meaning)
<br>
<br>
 
Let us now calculate the cosine similarities:
- **Apple & green apple**: cosine(0) ‚âà 1 (indicating extremely high similarity; vectors are in the same direction) ‚úÖ
- **Apple & banana**: cosine(20.1¬∞) ‚âà 0.93 (lower than apple & green apple, but still reflecting substantial similarity) ‚úÖ
- **Apple & operating system**: cosine(205.6¬∞) ‚âà -0.9 (the vectors point in nearly opposite directions, indicating strong dissimilarity) ‚úÖ


This captures the *semantic direction* rather than just magnitude!

![](../../data/photos/embedding_space_with_angles.png)



##
---
---

## üß™ Implementation: Let‚Äôs Apply What We Learned!

Now that we understand what embeddings are and how they help us measure the similarity between words or sentences, let‚Äôs implement it step by step.

In this assignment, you will:
1. Choose a few words
2. Generate embeddings for each using a pre-trained model
3. Compute cosine similarity between the embeddings
4. Interpret the results

---

#### Setup and Initialization

In [None]:
# Import required libraries
from sentence_transformers import SentenceTransformer
import numpy as np

# Initialize the embedding model
MODEL_NAME = "all-MiniLM-L6-v2"
model = SentenceTransformer(MODEL_NAME)

print(f"Model loaded successfully: {MODEL_NAME}")
print(f"Model dimension: {model.get_sentence_embedding_dimension()}")

####
---

#### Task 1 - Pick Your Words

Choose at least three words:
- Two words that you think are similar
- One word that is very different

In [None]:
#Example:
# words = ["apple", "banana", "green apple", "syrup"]

### 
---

#### Task 2 - Generate Embeddings


Use any embedding model (e.g., sentence-transformers) to generate dense vector representations.

In [None]:
# We can geneate embeedings in bulk by using the model.encode() method
embeddings = model.encode(words)

# Let's create a mapping of words to their embeddings
embedding_mapping = dict(zip(words, embeddings))


When we run model.encode, it gives us a numpy array of floats. Basically, each word is turned into a list of numbers, and the length of that list is 384 that's the 'dimension' of our embedding space.
So, no matter what word or sentence we give, it's always mapped to a float array of length 384.

### 
---

#### Task 3 - Calculate Cosine Similarity

Use the formula discussed earlier or a library function to compute similarities:

In [None]:
# Complete the cosine_similarity function. It should take in two arrays and return the cosine similarity between them.
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    raise NotImplementedError("You need to implement this function")
    # return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


In [None]:
# Now we will calculate the cosine similarity between all the words in our embedding_mapping dictionary.
# We will store the results in a dictionary called similarity_mapping.
similarity_mapping = {}
for word1 in embedding_mapping:
    for word2 in embedding_mapping:
        similarity_mapping[(word1, word2)] = cosine_similarity(embedding_mapping[word1], embedding_mapping[word2])


### 
---

#### Task 4 - Analyze the Results


Answer the following:
- Which pair has the highest similarity score?
- Does the result match your intuition?

In [None]:
# Visualizing the cosine similarity matrix using a heatmap.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap

# Extract the unique words to form the axes
words = sorted(set([key[0] for key in similarity_mapping.keys()]))

# Build the square similarity matrix
similarity_matrix = np.array([
    [similarity_mapping[(w1, w2)] for w2 in words]
    for w1 in words
], dtype=np.float32)

# Create custom colormap for better visual appeal
colors = ['#2d5aa6', '#5d8bc7', '#a8c5e8', '#f7f7f7', '#f4a582', '#d6604d', '#b2182b']
n_bins = 100
cmap = LinearSegmentedColormap.from_list('custom', colors, N=n_bins)

# Create figure with modern styling
fig, ax = plt.subplots(figsize=(10, 9), facecolor='white')
fig.patch.set_facecolor('white')

# Plot the heatmap
im = ax.imshow(similarity_matrix, cmap=cmap, vmin=0, vmax=1, aspect='auto', interpolation='nearest')

# Customize ticks and labels
ax.set_xticks(np.arange(len(words)))
ax.set_yticks(np.arange(len(words)))
ax.set_xticklabels(words, fontsize=11, fontweight='500', rotation=45, ha='right')
ax.set_yticklabels(words, fontsize=11, fontweight='500')

# Title styling
ax.set_title("Cosine Similarity Matrix", fontsize=18, fontweight='bold', pad=20, color='#2c3e50')

# Add grid for better readability
ax.set_xticks(np.arange(len(words)) - 0.5, minor=True)
ax.set_yticks(np.arange(len(words)) - 0.5, minor=True)
ax.grid(which="minor", color="white", linestyle='-', linewidth=2)
ax.tick_params(which="minor", size=0)

# Add annotations with dynamic text color based on background
for i in range(len(words)):
    for j in range(len(words)):
        value = similarity_matrix[i, j]
        # Choose text color based on background intensity
        text_color = 'white' if value > 0.6 or value < 0.3 else 'black'
        ax.text(j, i, f"{value:.2f}", 
                ha="center", va="center", 
                color=text_color, 
                fontsize=10,
                fontweight='600')

# Enhanced colorbar
cbar = fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
cbar.set_label('Cosine Similarity', fontsize=12, fontweight='600', color='#2c3e50')
cbar.ax.tick_params(labelsize=10)

# Remove top and right spines for cleaner look
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_color('#bdc3c7')
ax.spines['left'].set_color('#bdc3c7')

plt.tight_layout()
plt.show()

### 
---

#### Task 5 - Complete the Embedding Function


Complete the function below that:
- Takes a list of strings
- Generates embeddings
- Returns them in the same order



In [None]:
def get_embeddings(text_list):
    """
    Given a list of strings, return their embeddings
    in the exact same order as the input.
    """
    
    # TODO:
    # 1. Encode the text_list using the model
    # 2. Return the embeddings
    pass


üìå Additional Step - Update the embedding_service.py File

Once your get_embeddings function is complete and tested:<br>
**Add or update the same function inside** ``embedding_service.py``

##
---
---

## ‚ö†Ô∏è Limitations of Embeddings (Optional)

While embeddings are powerful tools for capturing semantic similarity, they have important limitations that we need to understand.


### A Critical Problem with Semantic Similarity

One significant limitation is how embeddings handle sentences with **opposite meanings**. Let's explore this with a concrete example.

Consider these two sentences:
- "Apple is a fruit"
- "Apple is not a fruit"

These sentences have **opposite meanings**, yet when we calculate their embeddings and measure similarity, we get surprisingly high similarity scores. Let's see this in action.

In [None]:
# Import required libraries
from sentence_transformers import SentenceTransformer
import numpy as np

# Initialize the embedding model
MODEL_NAME = "all-MiniLM-L6-v2"
model = SentenceTransformer(MODEL_NAME)

print(f"Model loaded successfully: {MODEL_NAME}")

In [None]:
# Define a cosine similarity function
def cosine_similarity(vec_a, vec_b):
    return np.dot(vec_a, vec_b) / (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))


In [None]:
# The Negation Problem: Demonstrating a key limitation of embeddings

# Two sentences with opposite meanings
sentence1 = "Apple is a fruit"
sentence2 = "Apple is not a fruit"

# Generate embeddings
embedding1 = model.encode(sentence1)
embedding2 = model.encode(sentence2)

# Calculate similarity
similarity = cosine_similarity(embedding1, embedding2)

print("=" * 60)
print("The Negation Problem")
print("=" * 60)
print(f"\nSentence 1: '{sentence1}'")
print(f"Sentence 2: '{sentence2}'")
print(f"\nCosine Similarity: {similarity:.4f}")
print("\n" + "-" * 60)
print("Analysis:")
print("-" * 60)
print(f"Despite having OPPOSITE meanings, these sentences have")
print(f"a similarity score of {similarity:.4f}!")
print(f"\nWhy? Because they share most of the same words:")
print(f"  - Both contain: 'apple', 'is', 'a', 'fruit'")
print(f"  - Only difference: the word 'not'")
print(f"\nEmbeddings capture word overlap more than logical meaning.")
print("=" * 60)

####
---

### Why Does This Happen?

Embeddings are based on the **distributional hypothesis**: words that appear in similar contexts tend to have similar meanings. When we embed sentences, the model primarily focuses on:

1. **Word Overlap**: Both sentences share most of their words
2. **Contextual Patterns**: The sentence structures are nearly identical
3. **Token-Level Similarities**: The presence of "apple", "fruit", etc. dominates

The single word "not" doesn't change the embedding enough to reflect the complete reversal of meaning.


####
---

### Implications

This limitation means embeddings may struggle with:
- **Negations**: "good" vs "not good"
- **Contradictions**: "X is Y" vs "X is not Y"  
- **Subtle Semantic Differences**: Where a small word change drastically alters meaning



####
---

### When Does This Matter?

This limitation is particularly important in:
- **Question Answering**: "Is X true?" vs "Is X false?"
- **Fact Verification**: Checking if statements contradict each other
- **Sentiment Analysis**: "I like this" vs "I don't like this"

For these use cases, more sophisticated approaches (like cross-encoders or fine-tuned models) may be needed.

##
---
---

## üéì Key Takeaways

### What We Learned:


1. **Embeddings Transform Words into Vectors**
   - Semantic meaning is captured in numerical form
   - Similar concepts cluster together in embedding space

2. **Cosine Similarity > Euclidean Distance**
   - Direction matters more than distance
   - Captures semantic relationships more accurately


####
---


### Applications:

- üîç **Semantic Search**: Find documents by meaning, not just keywords
- üóÇÔ∏è **Clustering**: Group similar items together
- üè∑Ô∏è **Classification**: Categorize text automatically
- ü§ñ **RAG Systems**: Retrieve relevant context for LLMs


##
---
---