An educational, hands-on repository to learn and compare two core information retrieval approaches:
- Keyword search with inverted indexes and BM25
- Semantic search with dense embeddings and cosine similarity
Use this repo as a quick revision sheet and a runnable playground with small movie metadata.
- The difference between keyword vs semantic search
- How TF, IDF, TF-IDF, and BM25 work and how to inspect them
- How sentence-transformer embeddings power semantic similarity search
- How to run both systems end-to-end from the command line
cli/keyword_search_cli.py: Commands for building/loading an inverted index and running BM25/TF/IDF queriessemantic_search_cli.py: Commands for generating embeddings and running semantic searchlib/keyword_search.py: Inverted index, TF/IDF, BM25 scoring and searchsemantic_search.py: Embedding generation, caching, and cosine similarity search
utils/keyword_seach_utils/: Preprocessing and BM25 constants/utilitiessemantic_search_utils/: Preprocessing and vector operations (cosine similarity)
data/movies.json: Sample dataset used by both CLIscache/: Auto-created; stores built indexes and embeddingspyproject.toml: Dependencies (Python ≥ 3.13)
- Builds an inverted index mapping terms → document IDs
- Uses exact/normalized token matching (after preprocessing)
- Scoring:
- TF (term frequency): occurrences of a token in a document
- IDF (inverse document frequency): how rare a token is across documents
- A common form:
idf(t) = ln( (N + 1) / (df(t) + 1) )
- A common form:
- TF-IDF:
tf × idf - BM25: a strong ranking function that saturates TF and normalizes for document length
- Uses tunable
k1(term frequency saturation) andb(length normalization)
- Uses tunable
Pros: transparent, fast, and simple. Cons: brittle to paraphrasing and synonyms.
- Converts text into dense vectors (embeddings) using
sentence-transformers(here:all-MiniLM-L6-v2) - Measures similarity via cosine similarity in vector space
- Captures meaning beyond exact words (synonyms, paraphrases, context)
Pros: robust to wording; understands semantics. Cons: needs a model and embeddings; less transparent.
- Python 3.13+
- Install dependencies (editable install is fine):
pip install -e .Notes:
- The first semantic run will download the sentence-transformer model.
- NLTK may download resources on first use depending on your environment.
- Ensure
data/movies.jsonexists (included in the repo). Both CLIs read this file. - Caches are written to
cache/automatically.
Script: cli/keyword_search_cli.py
Common commands:
- Build the inverted index (required once before searching):
python cli/keyword_search_cli.py build- BM25 search top 5 results:
python cli/keyword_search_cli.py bm25search "space adventure" 5- Simple keyword search (first 5 docs that match any query token):
python cli/keyword_search_cli.py search "space adventure"- Inspect TF for a term in a document:
python cli/keyword_search_cli.py tf 12 "space"- Inspect IDF and BM25 IDF of a term:
python cli/keyword_search_cli.py idf "space"
python cli/keyword_search_cli.py bm25idf "space"- Inspect BM25 TF (with optional k1, b):
python cli/keyword_search_cli.py bm25tf 12 "space" 1.2 0.75- Compute TF-IDF for a term in a document:
python cli/keyword_search_cli.py tfidf 12 "space"Where document IDs correspond to the id field in data/movies.json.
Script: cli/semantic_search_cli.py
Common commands:
- Verify the model loads:
python cli/semantic_search_cli.py verify- Create or load embeddings and verify their shape vs documents:
python cli/semantic_search_cli.py verify_embeddings- Embed a text/query (quick peek at the vector):
python cli/semantic_search_cli.py embed_text "galactic rescue mission"
python cli/semantic_search_cli.py embedquery "galactic rescue mission"- Run semantic search (top-k with cosine similarity):
python cli/semantic_search_cli.py search "galactic rescue mission" --limit 5This will print ranked results with their similarity scores and descriptions.
-
Keyword search (
cli/lib/keyword_search.py)- Preprocesses text, builds an inverted index, stores term frequencies and document lengths
- BM25 scoring via
get_bm25_tfandget_bm25_idf; combined inbm25and used inbm25_search - Index and stats cached in
cache/*.pkl
-
Semantic search (
cli/lib/semantic_search.py)- Uses
SentenceTransformer('all-MiniLM-L6-v2')to embed documents and queries - Embeddings cached to
cache/movie_embeddings.npy - Similarities computed with cosine similarity; top-k results returned
- Uses
-
"Cache directory not found" or "Run 'build' command first" (keyword search)
- Run:
python cli/keyword_search_cli.py build
- Run:
-
Embeddings shape mismatch (semantic search)
- The CLI will auto-rebuild if
movie_embeddings.npyshape doesn't match the dataset length - If issues persist, delete
cache/movie_embeddings.npyand rerunverify_embeddings
- The CLI will auto-rebuild if
-
Slow first run (semantic search)
- The model download happens once; subsequent runs will be faster
-
Tokenization & preprocessing
- Normalize case, strip punctuation, (optionally) remove stopwords and apply stemming/lemmatization.
- This repo applies consistent preprocessing for both indexing and querying to align tokens.
-
Term Frequency (TF)
- Raw TF in this repo: the count of a token in a document.
- Variants you may see elsewhere:
- Boolean TF: 0/1 if the term appears.
- Log TF:
1 + ln(tf)to dampen very high counts.
-
Document Frequency (DF) and Inverse Document Frequency (IDF)
- DF: number of documents containing the term.
- IDF in this repo (keyword search):
idf(t) = ln( (N + 1) / (df(t) + 1) )- Intuition: rare terms (small
df) get larger IDF → more discriminative power.
- Intuition: rare terms (small
-
TF-IDF
- A simple relevance score:
tfidf(d, t) = tf(d, t) × idf(t) - Good for quick ranking, but can overvalue long documents without length normalization.
- A simple relevance score:
-
BM25 (Okapi BM25)
- Purpose: improve TF-IDF by saturating TF gains and normalizing for document length.
- In this repo:
- Length normalization:
length_norm = 1 - b + b × (|d| / avgdl) - Saturated TF:
bm25_tf = (tf × (k1 + 1)) / (tf + k1 × length_norm) - BM25 IDF:
bm25_idf(t) = ln( (N - df + 0.5) / (df + 0.5) + 1 ) - Final per-term score:
bm25(d, t) = bm25_tf × bm25_idf(t)
- Length normalization:
- Tunables:
k1(default here via utils): controls TF saturation (~1.2–2.0 typical)b(default here via utils): degree of length normalization (0=no norm, 1=full)
-
Cosine similarity (semantic search)
- Measures angle between vectors (scale-invariant):
cos_sim(a, b) = (a · b) / (||a|| × ||b||)
- Range is [-1, 1]; in embedding spaces like
all-MiniLM-L6-v2, relevant pairs typically show higher positive values. - Why cosine? It focuses on direction (semantics) rather than magnitude.
- Measures angle between vectors (scale-invariant):
-
Embeddings
SentenceTransformer('all-MiniLM-L6-v2')encodes text into dense vectors (dimensions ~384).- Semantically similar texts map to nearby vectors → higher cosine similarity.
- This repo caches document embeddings to
cache/movie_embeddings.npyand computes query embeddings on the fly.
-
Normalization & scaling tips
- Preprocessing consistency is critical: index-time and query-time must match.
- For keyword search, consider stopword handling—removing very common words can sharpen results.
- For semantic search, you generally don’t need to L2-normalize manually; many models and libraries handle suitable scaling internally, but cosine inherently normalizes by vector length.
-
Caveats
- Keyword search can miss paraphrases and synonyms; semantic search can retrieve paraphrases but might be less transparent.
- Embeddings are model-dependent; domain-shift can hurt quality—consider domain-specific models for best results.
- TF: count of a token in a document
- IDF:
ln((N + 1) / (df + 1))— down-weights frequent terms - TF-IDF:
tf × idf— simple relevance signal - BM25 (intuition):
- Saturates TF gains, normalizes long docs, boosts rare terms
- Tunables:
k1(how fast TF saturates),b(how much length normalization)
- Embeddings + cosine similarity:
- Convert text → vector; similar meanings → close vectors (high cosine)
Educational use. Adapt as needed for your experiments.