### Manual Sparse vs Dense Search Demo

This notebook demonstrates the basic ideas behind keyword (sparse) search using TF-IDF
and a small dense-embedding example. It shows preprocessing, building TF-IDF sparse
representations, computing cosine similarities, and ranking results.

In [None]:
# Import libraries used for both sparse (TF-IDF) and dense (array) examples
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Note: scikit-learn provides simple utilities for small demos like this.

In [None]:
# Sample documents (small toy corpus)
documents = [
    "This is a list which containing sample documents.",
    "Keywords are important for keyword-based search.",
    "Document analysis involves extracting keywords.",
    "Keyword-based search relies on sparse embeddings."
]

# The examples below will show a simple flow: preprocess -> vectorize -> compare -> rank.

In [None]:
# Example user query
query = "keyword-based search"

In [None]:
import re

def preprocess_text(text):
    """Simple preprocessing: lowercasing and removing punctuation."""
    # Convert text to lowercase to make matching case-insensitive
    text = text.lower()
    # Remove punctuation (keep word characters and whitespace)
    text = re.sub(r'[^\w\s]', '', text)
    return text

# This is intentionally minimal; real pipelines may include tokenization, stemming,
# stopword removal, or more advanced normalization depending on needs.

In [None]:
# Preprocess all documents (apply same function to each document)
preprocess_documents = [preprocess_text(doc) for doc in documents]

In [None]:
# Show the preprocessed documents so you can inspect what TF-IDF will see
preprocess_documents

In [None]:
print("Preprocessed Documents:")
for doc in preprocess_documents:
    print(doc)

# This explicit print helps when running cells interactively to confirm preprocessing.

In [None]:
print("Preprocessed Query (raw):")
print(query)

In [None]:
# Preprocess the query the same way as documents so feature space matches
preprocessed_query = preprocess_text(query)

In [None]:
# Show preprocessed query
preprocessed_query

In [None]:
# Initialize TF-IDF vectorizer (uses token counts weighted by inverse doc frequency)
vector = TfidfVectorizer()

In [None]:
# Fit the vectorizer on the corpus and transform documents into sparse TF-IDF matrix
X = vector.fit_transform(preprocess_documents)

# X is a scipy sparse matrix (documents x features). For large corpora we generally keep it sparse.

In [None]:
# Convert sparse matrix to a dense NumPy array only for small demos / inspection
dense_X = np.asarray(X.todense())
dense_X

# Warning: converting to dense does not scale to large datasets. Keep sparse matrices for production.

In [None]:
# Transform the preprocessed query into the same TF-IDF feature space
query_embedding = vector.transform([preprocessed_query])

# query_embedding is also sparse (1 x n_features)

In [None]:
# If you want the dense array representation of the query embedding:
# query_embedding.toarray()  # uncomment to inspect as dense array

In [None]:
# Explicit convert sparse query embedding to numpy array for compatibility with some APIs
np.asarray(query_embedding.todense())

In [None]:
# Compute cosine similarity between each document and the query
# cosine_similarity handles sparse/dense inputs appropriately (document x features) vs (1 x features)
similarities = cosine_similarity(X, query_embedding)
similarities

# similarities is a column vector (n_docs x 1) containing similarity scores.

In [None]:
# Inspect the sort order (argsort returns indices that would sort the array)
np.argsort(similarities, axis=0)

In [None]:
# Ranking: argsort gives ascending order, so reverse to get descending similarity
ranked_indices = np.argsort(similarities, axis=0)[::-1].flatten()

# ranked_indices now contains document indices ordered from most to least similar

In [None]:
# Map indices back to the original document text
ranked_documents = [documents[i] for i in ranked_indices]
ranked_indices

In [None]:
# Output the ranked documents with their rank
for i, doc in enumerate(ranked_documents):
    print(f"Rank {i+1}: {doc}")

# In production you'd likely return ids + scores rather than printing text.

In [None]:
# Show the original query for reference
query

---

### Dense embedding example (toy data)

The following section uses small, hand-crafted dense vectors to show how cosine similarity
and ranking would work with dense embeddings (e.g., sentence-transformers outputs).
This is only illustrative; for real dense embeddings use a model like SentenceTransformers.

In [None]:
# Toy documents (re-declared for clarity)
documents = [
    "This is a list which containing sample documents.",
    "Keywords are important for keyword-based search.",
    "Document analysis involves extracting keywords.",
    "Keyword-based search relies on sparse embeddings."
]

In [None]:
# Reference: use an embedding model like sentence-transformers in real applications
# https://huggingface.co/sentence-transformers

In [None]:
# Small toy dense embeddings (each row is an embedding for a document)
document_embeddings = np.array([
    [0.634, 0.234, 0.867, 0.042, 0.249],
    [0.123, 0.456, 0.789, 0.321, 0.654],
    [0.987, 0.654, 0.321, 0.123, 0.456]
])

In [None]:
# Toy query represented as a dense vector (1 x dim)
query_embedding = np.array([[0.789, 0.321, 0.654, 0.987, 0.123]])

In [None]:
# Compute cosine similarity between dense query and dense document embeddings
similarities = cosine_similarity(document_embeddings, query_embedding)
similarities

# The resulting array gives similarity per document; higher is more similar.

In [None]:
# Show similarity scores
similarities

In [None]:
# Rank indices by similarity (descending)
ranked_indices = np.argsort(similarities, axis=0)[::-1].flatten()
ranked_indices

# You can pair these indices with original document texts or IDs for retrieval.

In [None]:
# Output the ranked documents from the dense example
for i, idx in enumerate(ranked_indices):
    print(f"Rank {i+1}: Document {idx+1}")

# End of notebook: this simple demo shows the two common building blocks used in
# hybrid search (sparse TF-IDF for keywords and dense embeddings for semantic similarity).