# Module 3: Embeddings Deep Dive

**Level:** Intermediate  
**Prerequisites:** Modules 1 & 2 completed

---

## Learning Objectives

By the end of this module, you will be able to:

- Understand what embeddings are and how they work
- Generate embeddings using different models
- Calculate similarity between embeddings
- Choose the right embedding model for your use case
- Implement semantic search with embeddings
- Understand embedding dimensions and trade-offs

---

# 1. What Are Embeddings?

## 1.1 The Core Concept

**Simple Definition:**
> Embeddings are numerical representations of text that capture meaning.

**Think of it like this:**
- Words are converted to numbers (vectors)
- Similar meanings â†’ Similar numbers
- Computers can then do math with meaning!

**Example:**
```
"cat"     â†’ [0.2, -0.5, 0.8, 0.1, ...] (384 numbers)
"kitten"  â†’ [0.3, -0.4, 0.7, 0.2, ...] (similar numbers!)
"car"     â†’ [-0.1, 0.6, -0.3, 0.9, ...] (very different numbers)
```

## 1.2 Why Do We Need Embeddings?

**Problem:** Computers don't understand text

**Traditional Approach (doesn't work for RAG):**
- Keyword matching: "cat" only matches "cat"
- Misses "kitten", "feline", "kitty"
- Can't understand meaning or context

**Embeddings Approach (what RAG uses):**
- Captures semantic meaning
- "cat" is similar to "kitten", "feline", "pet"
- Finds relevant content even with different words

**This is why embeddings enable semantic search!**

---

# 2. Generating Your First Embeddings

## 2.1 Setup

We'll use **sentence-transformers** - a popular, free, open-source library.

In [1]:
# Install required library
!pip install -q sentence-transformers

In [2]:
from sentence_transformers import SentenceTransformer
import numpy as np

print("âœ… Libraries imported successfully!")

  from .autonotebook import tqdm as notebook_tqdm


âœ… Libraries imported successfully!


## 2.2 Load an Embedding Model

In [3]:
# Load a small, fast embedding model
print("Loading embedding model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print("âœ… Model loaded!")
print(f"Model produces {model.get_sentence_embedding_dimension()} dimensional embeddings")

Loading embedding model...


'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 430cbc8c-6685-483e-a969-9780951df532)')' thrown while requesting HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/./modules.json
Retrying in 1s [Retry 1/5].
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Error while downloading from https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/model.safetensors: HTTPSConnectionPool(host='cas-bridge.xethub.hf.co', port=443): Read tim

âœ… Model loaded!
Model produces 384 dimensional embeddings


## 2.3 Generate Embeddings

In [4]:
# Simple example
text = "The cat sat on the mat"

# Generate embedding
embedding = model.encode(text)

print(f"Original text: {text}")
print(f"Embedding shape: {embedding.shape}")
print(f"Embedding type: {type(embedding)}")
print(f"\nFirst 10 values: {embedding[:10]}")

Original text: The cat sat on the mat
Embedding shape: (384,)
Embedding type: <class 'numpy.ndarray'>

First 10 values: [ 0.13040186 -0.01187012 -0.02811704  0.05123863 -0.05597441  0.03019154
  0.03016129  0.02469839 -0.01837056  0.05876678]


### ðŸ’¡ What just happened?

- Input: 6 words
- Output: 384 numbers (a vector)
- These numbers capture the meaning of the sentence
- We can now do math with this meaning!

---

# 3. Similarity: The Heart of RAG

## 3.1 Cosine Similarity Explained

**Cosine similarity** measures how similar two vectors are.

**Range:** -1 to 1
- **1.0** = Identical meaning
- **0.8-0.9** = Very similar
- **0.5-0.7** = Somewhat related
- **< 0.3** = Not related

**Formula:** Don't worry about the math - just use the function!

In [5]:
def cosine_similarity(vec1, vec2):
    """
    Calculate cosine similarity between two vectors.
    
    Returns a score between -1 and 1 (higher = more similar)
    """
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot_product / (norm1 * norm2)

print("âœ… Similarity function ready!")

âœ… Similarity function ready!


## 3.2 Testing Similarity

In [6]:
# Create test sentences
sentences = [
    "The cat sat on the mat",
    "A feline rested on the rug",      # Similar meaning, different words
    "Dogs are loyal animals",          # Different topic
    "Python is a programming language" # Completely unrelated
]

# Generate embeddings for all sentences
embeddings = model.encode(sentences)

# Compare first sentence to all others
print("Comparing to: 'The cat sat on the mat'\n")
for i, sentence in enumerate(sentences):
    similarity = cosine_similarity(embeddings[0], embeddings[i])
    print(f"Similarity to '{sentence}'")
    print(f"Score: {similarity:.3f}\n")

Comparing to: 'The cat sat on the mat'

Similarity to 'The cat sat on the mat'
Score: 1.000

Similarity to 'A feline rested on the rug'
Score: 0.564

Similarity to 'Dogs are loyal animals'
Score: 0.165

Similarity to 'Python is a programming language'
Score: 0.031



### ðŸ’¡ Observations

Notice how:
- âœ… "A feline rested on the rug" has HIGH similarity (same meaning, different words)
- âœ… "Dogs are loyal animals" has MEDIUM similarity (animals, but different context)
- âœ… "Python is a programming language" has LOW similarity (completely different)

**This is semantic search in action!**