<a href="https://colab.research.google.com/github/Nareshedagotti/RAG/blob/main/Day_3_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Day 3: Embeddings – The Heart of RAG Systems**


#### **What Are Embeddings and Why They Matter in RAG Systems?**

In our journey through Retrieval Augmented Generation (RAG), we've explored the fundamentals and document processing. Today, we dive into arguably the most important component: embeddings. These mathematical representations power the "retrieval" in RAG and determine whether your system finds relevant information or misses the mark entirely.

##### **What Are Embeddings?**

Embeddings are numerical representations of text (or other data) in the form of dense vectors—essentially lists of numbers. These vectors capture the semantic meaning of the text in a way that computers can understand and compare.
Think of embeddings as translating human language into "computer language." When we convert words, sentences, or documents into embeddings, we're creating mathematical representations that preserve semantic relationships.

##### **The Mathematics Behind Embeddings**

At their core, embeddings map words or text to points in a high-dimensional space (typically hundreds or thousands of dimensions). Let's break this down with a simple example:

Imagine we want to represent three simple words with 3-dimensional embeddings (real embeddings use many more dimensions):

"King" = [0.7, 0.2, 0.1]

"Queen" = [0.6, 0.3, 0.1]

"Apple" = [0.1, 0.2, 0.8]

Notice how "King" and "Queen" have similar vectors because they're semantically related (both royalty), while "Apple" has a very different vector (it's an unrelated concept).

In practice, embedding models like those from OpenAI or HuggingFace perform complex calculations using neural networks to produce these vectors. The calculations involve:
1. Breaking text into tokens (words or subwords)
2. Passing these tokens through layers of a neural network
3. Extracting the final layer's numerical representations

The exact computation involves matrix multiplications, non-linear activation functions, and other operations that capture patterns from massive training datasets of text.

##### **Why Use Embeddings?**

Embeddings solve a fundamental problem: computers don't naturally understand human language or concepts. By converting text to vectors, we can:

1. Measure semantic similarity between texts mathematically
2. Search by meaning rather than just keywords
3. Cluster similar concepts together
4. Find relevant information even when exact words don't match

##### **Real-world Example**

Imagine a user asks: "What's the treatment for high blood pressure?"

* **Without embeddings:** A simple keyword search might miss documents that talk about "hypertension treatment" or "reducing elevated blood pressure" because the exact words don't match.
* **With embeddings:** The system understands that "high blood pressure" and "hypertension" are semantically similar and retrieves relevant documents regardless of the exact terminology used.

##### **Measuring Semantic Similarity**

The power of embeddings comes from our ability to mathematically measure how similar two pieces of text are. The two most common similarity metrics are:

1. Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors. It ranges from -1 (completely opposite) to 1 (identical).

For two embedding vectors A and B, the cosine similarity is calculated as:

Cosine Similarity = (A · B) / (||A|| × ||B||)

Where:

* A · B is the dot product: sum(A[i] × B[i]) for all dimensions i
* ||A|| is the magnitude (length) of vector A: sqrt(sum(A[i]²))
* ||B|| is the magnitude (length) of vector B: sqrt(sum(B[i]²))

**Example calculation** with our simplified vectors:

* A = "King" = [0.7, 0.2, 0.1]
* B = "Queen" = [0.6, 0.3, 0.1]

Dot product: (0.7 × 0.6) + (0.2 × 0.3) + (0.1 × 0.1) = 0.42 + 0.06 + 0.01 = 0.49

||A|| = sqrt(0.7² + 0.2² + 0.1²) = sqrt(0.49 + 0.04 + 0.01) = sqrt(0.54) ≈ 0.735

||B|| = sqrt(0.6² + 0.3² + 0.1²) = sqrt(0.36 + 0.09 + 0.01) = sqrt(0.46) ≈ 0.678

Cosine Similarity = 0.49 / (0.735 × 0.678) ≈ 0.49 / 0.498 ≈ 0.984

This high value (close to 1) indicates "King" and "Queen" are very similar in our embedding space.


##### **2. Euclidean Distance**

Euclidean distance measures the straight-line distance between two points in space. Lower values indicate similarity.

Euclidean Distance = sqrt(sum((A[i] - B[i])²))

For our example:

* √((0.7-0.6)² + (0.2-0.3)² + (0.1-0.1)²)
* √(0.01 + 0.01 + 0)
* √0.02 ≈ 0.141

This relatively small distance also indicates similarity between "King" and "Queen".

**Why cosine similarity is preferred:** Cosine similarity focuses on the direction of vectors rather than magnitude, making it less sensitive to document length and more focused on content similarity.


##### **Types of Embeddings**

Not all embeddings are created equal. Here are the main types:

1. Word Embeddings

 * What they are: Represent individual words as vectors
 * Examples: Word2Vec, GloVe, FastText
 * Limitations: Don't capture context well; "bank" (financial) and "bank" (river) have the same embedding

 Word2Vec("bank") = [0.2, 0.5, -0.1, ...]
2. Contextual Embeddings

 * What they are: Represent words based on surrounding context
 * Examples: BERT, RoBERTa, T5
 * Advantages: Capture word meaning in context; "bank" has different embeddings in different contexts

 BERT("I deposited money at the bank") → "bank" = [0.4, 0.3, -0.2, ...]

 BERT("I sat by the river bank") → "bank" = [0.1, -0.2, 0.5, ...]

3. Sentence/Document-Level Embeddings

 * What they are: Represent entire sentences or documents as single vectors
 * Examples: Sentence-BERT, OpenAI embeddings, Universal Sentence Encoder
 * Advantages: Capture holistic meaning, efficient for retrieval

 OpenAI("Climate change is a global challenge") = [0.02, -0.15, 0.08, ...]

4. Bi-encoders vs. Cross-encoders

 * Bi-encoders: Encode query and document separately, then compare vectors (fast but less accurate)
 * Cross-encoders: Consider query and document together (more accurate but computationally expensive)


1. **Model:** text-embedding-3-small

* **Source:** OpenAI

* **Best For:** General-purpose, high-quality, fast

* **Dimensions:** 1536

2. **Model:** all-MiniLM-L6-v2
* **Source:** HuggingFace
* **Best For:** Lightweight, quick search
* **Dimensions:** 384

3. **Model:** BAAI/bge-small-en
* **Source:** HuggingFace
* **Best For:** Great for English Q&A
* **Dimensions:** 384

4. **Model:** Cohere/embed-english-v3.0
* **Source:** Cohere
* **Best For:** Semantic search and classification
* **Dimensions:** 1024

5. **Model:** hkunlp/instructor-xl
* **Source:** HuggingFace
* **Best For:** Instruction-aware embedding
* **Dimensions:** 768+

##### **How to Select an Embedding Model**

Consider these factors when choosing an embedding model:

1. **Performance:** How well does it capture semantic similarity for your specific domain?
2. **Speed:** How quickly can it generate embeddings? (Critical for real-time applications)
3. **Cost:** API-based models like OpenAI have usage costs; open-source models have computing costs
4. **Language support:** Does it work well for your target languages?
5. **Dimensions:** Higher dimensions often capture more nuance but require more storage

##### **Why Dimensions Matter**
Vector dimensions represent the "expressiveness" of the embedding space:
* Low dimensions (32-128): Faster, smaller storage, but less expressive
* Medium dimensions (384-768): Good balance for many applications
* High dimensions (1024+): More expressive, better for complex language understanding

Think of dimensions as the "vocabulary" of the embedding space. More dimensions allow the model to capture more nuanced relationships, just as a larger vocabulary allows more precise expression in human language.


##### **Practical Code to Generate Embeddings**

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["Regenerative farming helps soil", "How to use fertilizer safely?"]
embeddings = model.encode(texts)

print(embeddings[0])  # Vector for first sentence

##### **Semantic Similarity in Code**

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

query = "Tips for healthy soil"
query_vec = model.encode([query])
scores = cosine_similarity(query_vec, embeddings)

most_similar = texts[scores[0].argmax()]

##### **Common Mistakes with Embeddings in RAG Systems**

1. **Using mismatched embeddings:** Ensure you use the SAME embedding model for both document indexing and queries
2. **Ignoring language diversity:** Some embedding models perform poorly on non-English content or specialized domains
3. **Overlooking chunking strategy:** The way you split documents affects embedding quality; chunks need sufficient context
4. **Failing to normalize vectors:** Some distance calculations require normalized vectors for proper comparison
5. **Using embedding models that don't match your use case:** Don't use a general-purpose model for highly specialized content
6. **Ignoring dimensions:** Selecting models with too few dimensions can limit semantic expressiveness
7. **Not considering retrieval method:** Different vector databases and retrieval algorithms may work better with specific embedding types

##### **Why Embeddings Make or Break Your RAG System**

The quality of embeddings directly affects retrieval performance:

* Poor embeddings → Irrelevant documents retrieved → Inaccurate or hallucinated LLM responses
* Good embeddings → Relevant documents retrieved → Factual, grounded LLM responses

Think of embeddings as the "eyes" of your RAG system. If they can't properly see the semantic relationships between text, everything downstream will suffer.

Conclusion

Embeddings are the mathematical backbone that enables RAG systems to understand meaning and find relevant information. By transforming text into numerical vectors that preserve semantic relationships, embeddings allow computers to grasp relationships between concepts that would be impossible with simple keyword matching.

As you build RAG applications, investing time in selecting the right embedding model and properly configuring your embedding pipeline will pay enormous dividends in system accuracy and user satisfaction.

In our next session, we'll explore vector databases—specialized systems designed to efficiently store and search these embedding vectors at scale.