<a href="https://colab.research.google.com/github/Krishishah7/nlp-learning-series/blob/main/03_search/tfidf_search_vs_semantic_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install sentence-transformers scikit-learn pandas

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

In [None]:
documents = [
    "Machine learning improves predictive models",
    "Artificial intelligence is transforming industries",
    "Natural language processing works with text data",
    "Python is widely used in data science",
    "Cooking food is a creative hobby"
]

In [None]:
query = "AI and machine learning"

In [None]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
query_tfidf = tfidf_vectorizer.transform([query])

In [None]:
tfidf_scores = cosine_similarity(query_tfidf, tfidf_matrix)[0]

In [None]:
tfidf_results = pd.DataFrame({
    "Document": documents,
    "TF-IDF Similarity": tfidf_scores
}).sort_values(by="TF-IDF Similarity", ascending=False)

tfidf_results

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

In [None]:
doc_embeddings = model.encode(documents)
query_embedding = model.encode([query])

In [None]:
semantic_scores = cosine_similarity(query_embedding, doc_embeddings)[0]

In [None]:
semantic_results = pd.DataFrame({
    "Document": documents,
    "Semantic Similarity": semantic_scores
}).sort_values(by="Semantic Similarity", ascending=False)

semantic_results

In [13]:
comparison = pd.DataFrame({
    "Document": documents,
    "TF-IDF Score": tfidf_scores,
    "Semantic Score": semantic_scores
}).sort_values(by="Semantic Score", ascending=False)

comparison

Unnamed: 0,Document,TF-IDF Score,Semantic Score
0,Machine learning improves predictive models,0.632456,0.587651
1,Artificial intelligence is transforming indust...,0.0,0.555996
2,Natural language processing works with text data,0.0,0.274609
3,Python is widely used in data science,0.0,0.20806
4,Cooking food is a creative hobby,0.0,0.176244


- This notebook compares traditional TF-IDF based search with semantic search using sentence embeddings.
- TF-IDF relies on keyword overlap between the query and documents, while semantic search captures contextual meaning.
- The comparison shows how semantic search retrieves more meaningful results even when exact keywords do not match.
