<a href="https://colab.research.google.com/github/ArkS0001/Improve-RAGs/blob/main/Dense%2BSparse.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Dense Embeddings (Using SentenceTransformers)

Dense embeddings capture semantic meaning using deep learning models.

Pros of Dense Embeddings

✅ Captures semantic meaning
✅ Works well for general-purpose retrieval
Cons

❌ Can miss exact keyword matches
❌ Requires a good embedding model

In [1]:
from sentence_transformers import SentenceTransformer

# Load a dense embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")  # You can replace with other models

# Example queries
query = "best laptop for coding"
documents = ["MacBook Pro for developers", "Budget laptop with good battery", "Gaming laptop with high performance"]

# Convert query and documents to embeddings
query_embedding = model.encode(query)
doc_embeddings = model.encode(documents)

# Compute cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

similarities = cosine_similarity([query_embedding], doc_embeddings)
best_match_index = np.argmax(similarities)

print(f"Best match: {documents[best_match_index]} (Score: {similarities[0][best_match_index]:.4f})")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Best match: Budget laptop with good battery (Score: 0.6108)


# Sparse Embeddings (Using BM25 from Rank-BM25)

Sparse embeddings rely on keyword-based retrieval.

Pros of Sparse Embeddings

✅ Matches exact keywords
✅ No deep learning model required
Cons

❌ Lacks semantic understanding
❌ Can’t match synonyms or contextual meanings

In [8]:
pip install rank_bm25




In [12]:
import nltk
nltk.download('punkt')




[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [13]:
from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize

# Example documents
documents = [
    "MacBook Pro for developers",
    "Budget laptop with good battery",
    "Gaming laptop with high performance"
]

# Tokenize documents
tokenized_docs = [word_tokenize(doc.lower()) for doc in documents]

# Initialize BM25
bm25 = BM25Okapi(tokenized_docs)

# Query processing
query = "best laptop for coding"
tokenized_query = word_tokenize(query.lower())

# Get BM25 scores
scores = bm25.get_scores(tokenized_query)
best_match_index = np.argmax(scores)

print(f"Best match: {documents[best_match_index]} (Score: {scores[best_match_index]:.4f})")


Best match: MacBook Pro for developers (Score: 0.5459)


#Hybrid Search (Combining Dense + Sparse)

Hybrid search improves retrieval by combining dense (semantic) and sparse (keyword-based) methods.

Why Hybrid Search?

✅ Combines keyword matching and semantic understanding
✅ More accurate and robust for RAG systems

In [14]:
# Normalize scores
dense_score = similarities[0] / np.max(similarities)  # Normalize cosine similarity
sparse_score = scores / np.max(scores)  # Normalize BM25 scores

# Weighted sum of both scores
hybrid_score = 0.5 * dense_score + 0.5 * sparse_score

best_match_index = np.argmax(hybrid_score)
print(f"Best match (Hybrid): {documents[best_match_index]} (Score: {hybrid_score[best_match_index]:.4f})")


Best match (Hybrid): MacBook Pro for developers (Score: 0.9325)
