# Understanding Embedding Models for RAG

## What are Embedding Models?

An **embedding model** is a neural network that converts text (words, sentences, or documents) into dense numerical vectors (embeddings). These vectors capture the semantic meaning of the text in a high-dimensional space.

### Why Do We Need Embeddings?

- **Computers don't understand text naturally** - they work with numbers
- **Traditional keyword matching is limited** - it misses semantic relationships
- **Embeddings capture meaning** - similar concepts have similar vector representations
- **Enable semantic search** - find documents by meaning, not just exact word matches

## How Embedding Models Work

```
Input Text: "The cat sat on the mat"
           ↓ (Embedding Model)
Output Vector: [0.2, -0.5, 0.8, 0.1, ..., -0.3]
              (typically 384, 512, or 1536 dimensions)
```

### Key Properties of Good Embeddings:

1. **Semantic Similarity**: Similar meanings → similar vectors
2. **Dimensionality**: Usually 100-1536 dimensions
3. **Dense Representation**: Every dimension has a meaningful value
4. **Cosine Similarity**: Measure how "close" two vectors are

## Popular Embedding Models

### 1. **Sentence Transformers** (Most Common for RAG)
```python
from sentence_transformers import SentenceTransformer

# Popular models:
model = SentenceTransformer('all-MiniLM-L6-v2')      # Fast, 384 dim
model = SentenceTransformer('all-mpnet-base-v2')     # Better quality, 768 dim
model = SentenceTransformer('all-distilroberta-v1')  # Good balance, 768 dim
```

**Pros**: Purpose-built for sentence embeddings, great performance
**Cons**: Requires separate installation

### 2. **OpenAI Embeddings**
```python
import openai
response = openai.Embedding.create(
    model="text-embedding-ada-002",  # 1536 dimensions
    input="Your text here"
)
```

**Pros**: High quality, maintained by OpenAI
**Cons**: Requires API key, costs money, internet connection

### 3. **Hugging Face Models**
```python
from transformers import AutoModel, AutoTokenizer

model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
```

**Pros**: Many options, free, can run locally
**Cons**: More complex setup

## ChromaDB and Embedding Models

### Default Behavior
ChromaDB uses a **default embedding model** if you don't specify one:
```python
collection = client.create_collection(name="my_collection")
# Uses default embedding function automatically
```

### Custom Embedding Models
```python
from chromadb.utils import embedding_functions

# Option 1: Sentence Transformers
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Option 2: OpenAI
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-api-key",
    model_name="text-embedding-ada-002"
)

# Create collection with custom embedding
collection = client.create_collection(
    name="my_collection",
    embedding_function=sentence_transformer_ef
)
```

## Practical Example: Seeing Embeddings in Action

```python
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Example texts
texts = [
    "The cat is sleeping on the couch",
    "A feline is resting on the sofa",      # Similar meaning
    "The dog is barking loudly",            # Different meaning
    "Machine learning is a subset of AI"    # Completely different
]

# Generate embeddings
embeddings = model.encode(texts)
print(f"Shape of embeddings: {embeddings.shape}")  # (4, 384)

# Calculate similarities
similarities = cosine_similarity(embeddings)
print("Similarity Matrix:")
for i, text in enumerate(texts):
    print(f"{i}: {text[:30]}...")
    
for i in range(len(texts)):
    for j in range(len(texts)):
        print(f"Text {i} vs Text {j}: {similarities[i][j]:.3f}")
```

## Choosing the Right Embedding Model

### Consider These Factors:

1. **Domain**: General vs specialized (legal, medical, etc.)
2. **Performance**: Speed vs accuracy trade-offs
3. **Resources**: Memory and computational requirements
4. **Cost**: Free vs paid models
5. **Language**: Multi-language support needed?

### Recommendations:

| Use Case | Recommended Model | Why |
|----------|------------------|-----|
| **Learning/Prototyping** | `all-MiniLM-L6-v2` | Fast, lightweight, good quality |
| **Production RAG** | `all-mpnet-base-v2` | Better quality, still manageable size |
| **High-end Applications** | OpenAI `text-embedding-ada-002` | State-of-the-art quality |
| **Multilingual** | `paraphrase-multilingual-MiniLM-L12-v2` | Supports 50+ languages |

## Common Pitfalls to Avoid

1. **Mixing embedding models**: Always use the same model for indexing and querying
2. **Ignoring context length**: Models have token limits (usually 512 tokens)
3. **Not considering domain**: Generic models might not work well for specialized content
4. **Overlooking preprocessing**: Text cleaning can significantly impact quality

## Testing Your Understanding

Try this exercise:
1. Take the sentence "The weather is beautiful today"
2. Generate embeddings using different models
3. Compare with "Today has lovely weather" and "The computer is broken"
4. Observe which pairs have higher similarity scores

## Next Steps

Now that you understand embedding models, we can:
1. Set up ChromaDB with a specific embedding model
2. Add our dummy documents to the collection
3. Perform semantic searches
4. Build a complete RAG pipeline

Remember: **The quality of your RAG system heavily depends on choosing the right embedding model for your use case!**

In [1]:
from dotenv import load_dotenv
import os
from google import genai

In [2]:
load_dotenv(override=True)
api_key = os.getenv("GOOGLE_API_KEY")

In [None]:
from google import genai

client = genai.Client(api_key=api_key)

result = client.models.embed_content(
        model="gemini-embedding-001", # 768
        contents="What is the meaning of life?")

print(result.embeddings)
embedding_vector = result.embeddings[0]

[ContentEmbedding(
  values=[
    -0.022374554,
    -0.004560777,
    0.013309286,
    -0.0545072,
    -0.02090443,
    <... 3067 more items ...>,
  ]
)]
3072


In [4]:
print(len(embedding_vector.values))

3072
