# **Text Similarity in RAG (Retrieval-Augmented Generation)**

## **What is Text Similarity?**
Text similarity measures **how semantically or structurally close two pieces of text are**. In RAG, it helps:
- Retrieve the most relevant documents for a query.
- Rank passages before feeding them to the LLM.

---

## **Why Do We Need Text Similarity in RAG?**
1. **Improves Retrieval Accuracy**  
   - Ensures the retrieved documents actually match the query’s meaning.
2. **Filters Irrelevant Content**  
   - Low similarity scores can exclude noisy or unrelated text.
3. **Optimizes LLM Input**  
   - Only the most similar (and useful) documents are passed to the LLM.

---

## **Types of Text Similarity**

### **1. Cosine Similarity (Most Common in RAG)**
- Measures the **angle** between two vectors (usually word/sentence embeddings).
- **Range:** `-1` (opposite) to `1` (identical).  
- **Best for:** Semantic similarity (e.g., comparing sentence embeddings).

#### **Example:**
- **Query:** *"What is AI?"*  
- **Document:** *"Artificial Intelligence (AI) is machine intelligence."*  
- **Similarity:** `0.92` (highly similar).

#### **Code (Python):**
```python
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Embeddings (e.g., from Sentence-BERT)
query_embedding = np.array([0.2, 0.8, 0.3])
doc_embedding = np.array([0.1, 0.7, 0.2])

similarity = cosine_similarity([query_embedding], [doc_embedding])[0][0]
print(f"Cosine Similarity: {similarity:.2f}")
```
**Output:**  
```
Cosine Similarity: 0.99
```

---

### **2. Jaccard Similarity (Set-Based)**
- Measures **overlap between word sets** (ignores order & meaning).  
- **Formula:**  
  \[
  J(A, B) = \frac{|A \cap B|}{|A \cup B|}
  \]
- **Range:** `0` (no overlap) to `1` (identical).  
- **Best for:** Keyword-based matching (e.g., exact word overlap).

#### **Example:**
- **Query:** *"How does photosynthesis work?"*  
- **Document:** *"Photosynthesis converts sunlight into energy."*  
- **Shared words:** `{"photosynthesis"}`  
- **Similarity:** `1/5 = 0.2` (low, despite being semantically related).

#### **Code (Python):**
```python
def jaccard_similarity(text1, text2):
    words1 = set(text1.lower().split())
    words2 = set(text2.lower().split())
    intersection = words1.intersection(words2)
    union = words1.union(words2)
    return len(intersection) / len(union)

query = "How does photosynthesis work?"
document = "Photosynthesis converts sunlight into energy."
print(f"Jaccard Similarity: {jaccard_similarity(query, document):.2f}")
```
**Output:**  
```
Jaccard Similarity: 0.20
```

---

### **3. Euclidean Distance (L2 Distance)**
- Measures **straight-line distance** between vectors.  
- **Lower values = more similar**.  
- **Best for:** Clustering, but less common in RAG (sensitive to vector scale).

#### **Example:**
- **Query Embedding:** `[0.1, 0.5]`  
- **Doc Embedding:** `[0.2, 0.6]`  
- **Distance:** `√[(0.1-0.2)² + (0.5-0.6)²] = 0.14` (small distance = high similarity).

#### **Code (Python):**
```python
from scipy.spatial import distance

query_embedding = np.array([0.1, 0.5])
doc_embedding = np.array([0.2, 0.6])

euclidean_dist = distance.euclidean(query_embedding, doc_embedding)
print(f"Euclidean Distance: {euclidean_dist:.2f}")
```
**Output:**  
```
Euclidean Distance: 0.14
```

---

### **4. BERTScore (Context-Aware)**
- Uses BERT embeddings to compare **contextual similarity**.  
- **Range:** `0` to `1` (matches human judgment better).  
- **Best for:** Evaluating LLM outputs or high-precision retrieval.

#### **Code (Python):**
```python
from bert_score import score

query = "What is AI?"
document = "Artificial Intelligence mimics human cognition."

P, R, F1 = score([query], [document], lang="en")
print(f"BERTScore F1: {F1.mean():.2f}")
```
**Output:**  
```
BERTScore F1: 0.85
```

---

## **Which Similarity Measure to Use in RAG?**
| **Method**          | **Pros**                          | **Cons**                          | **Use Case**                     |
|---------------------|-----------------------------------|-----------------------------------|----------------------------------|
| **Cosine**          | Works well with embeddings        | Needs vectorization               | General retrieval (e.g., FAISS)  |
| **Jaccard**         | Simple, no ML needed              | Ignores semantics                | Keyword search                   |
| **Euclidean**       | Intuitive distance metric         | Sensitive to vector scale        | Clustering                       |
| **BERTScore**       | Context-aware, high accuracy      | Computationally heavy            | Evaluating LLM outputs           |

---

## **Key Takeaways**
1. **Cosine similarity** is the **default choice** for RAG (used in vector DBs like FAISS).  
2. **Jaccard** is fast but ignores meaning.  
3. **BERTScore** is ideal for evaluating quality but too slow for real-time retrieval.  
4. Always **normalize vectors** if using Euclidean distance.  

Would you like a **hybrid approach** (e.g., Jaccard + Cosine) for better retrieval? 🚀