A production-ready information retrieval system built on the MS MARCO dataset, featuring multiple search algorithms including BM25, LambdaMART re-ranking, semantic search, and query expansion techniques.
This project implements a comprehensive document search engine that combines traditional IR techniques with modern machine learning approaches. Built for the MS MARCO (Microsoft MAchine Reading COmprehension) Document Ranking dataset containing 3.2 million documents, it provides a web interface to compare different retrieval and ranking methods.
- 5 Search Methods: BM25, LambdaMART re-ranking, Semantic Search, Query Expansion (RM3), Boolean Search
- Smart Autocomplete: Typo-tolerant fuzzy matching with "Did you mean?" suggestions
- ML-Powered Re-ranking: XGBoost-based LambdaMART model with 7 handcrafted features
- Semantic Understanding: Neural embeddings using sentence-transformers
- Production-Ready: Flask backend, responsive UI, sub-second response times
- Comprehensive Evaluation: Rigorous testing with MRR@10 and nDCG@10 metrics
Evaluated on 50 MS MARCO dev queries:
| Method | MRR@10 | nDCG@10 | vs Baseline |
|---|---|---|---|
| BM25 (Baseline) | 0.2319 | 0.3046 | - |
| LambdaMART | 0.4982 | 0.5513 | +115% π |
| Semantic Search | 0.4122 | 0.4798 | +78% |
| RM3 (Query Expansion) | 0.0712 | 0.1188 | -69% * |
| Boolean Search | 0.2319 | 0.3046 | 0% ** |
* RM3 underperforms on MS MARCO because queries are already highly specific
** Boolean search identical to BM25 on natural language queries (no operators)
βββββββββββββββββββββββββββββββββββββββ
β Frontend (HTML/CSS/JavaScript) β
β β’ Real-time autocomplete β
β β’ 5 search mode selector β
β β’ Query correction UI β
ββββββββββββββββ¬βββββββββββββββββββββββ
β HTTP/JSON
ββββββββββββββββΌβββββββββββββββββββββββ
β Backend (Flask Server) β
β β’ 6 API endpoints β
β β’ Model integration β
β β’ Feature extraction β
ββββββββββββββββ¬βββββββββββββββββββββββ
β
ββββββββββββΌβββββββββββ
β β β
βββββΌβββ βββββΌβββ βββββΌβββββ
βPyseriniββXGBoostββSemantic β
β BM25 ββLambdaMββTransformβ
β3.2M ββ7 featsββ 384-dim β
βdocs ββ ββ β
βββββββββββββββββββββββββββββ
Backend:
- Python 3.8+
- Flask (Web server)
- Pyserini (BM25 search, Lucene integration)
- XGBoost (LambdaMART ranking)
- scikit-learn (TF-IDF, feature extraction)
- sentence-transformers (Semantic embeddings)
- thefuzz (Fuzzy string matching)
- pandas, numpy (Data processing)
Frontend:
- Vanilla HTML5/CSS3/JavaScript (No frameworks)
- Google Fonts (Inter, Roboto Mono)
- Font Awesome icons
Models & Data:
- MS MARCO v1 Document Ranking (3.2M documents)
- LambdaMART XGBoost model (446KB)
- TF-IDF Vectorizer (70MB)
- MiniLM-L6-v2 semantic model (80MB)
- Python 3.8 or higher
- 8GB+ RAM (for loading models and index)
- ~12GB disk space (for index and models)
- Clone the repository
git clone https://github.com/yourusername/marcode.git
cd marcode- Install dependencies
pip install -r requirements.txt- Download/Setup the Pyserini Index
The MS MARCO index is large (~4GB). You have two options:
Option A: Use Pre-built Index
# Index will auto-download on first run
# Or manually download from: https://git.uwaterloo.ca/jimmylin/anserini-indexesOption B: Build from Scratch
# See Pyserini documentation for indexing MS MARCO documents- Download Models (if not included)
The repository includes:
Model/lambdamart_reranker_final.json- LambdaMART modelModel/tfidf_vectorizer.pkl- TF-IDF vectorizer
The semantic model downloads automatically from HuggingFace on first use.
- Run the Application
python app.pyThe server starts at http://localhost:5000
- Open
http://localhost:5000in your browser - Select a search method from the dropdown:
- Standard (BM25)
- Re-ranking (LambdaMART)
- Semantic Search
- BM25 + Query Expansion (RM3)
- Boolean Search
- Type your query (autocomplete suggestions appear)
- Press Enter or click Search
- View ranked results with smart snippets
Natural Language:
- "what is machine learning"
- "how do vaccines work"
- "who is the current microsoft ceo"
Boolean (advanced):
- "machine AND learning NOT vision"
- "python AND (programming OR coding)"
- "vaccine AND (efficacy OR effectiveness)"
Search
POST /search
Content-Type: application/json
{
"query": "machine learning",
"method": "lambdamart" # bm25, lambdamart, semantic, bm25_rm3, boolean
}Autocomplete
GET /suggestions?q=machneResponse
{
"results": [
{
"rank": 1,
"doc_id": "D881266",
"title": "What is Machine Learning?",
"url": "https://...",
"snippet": "Machine learning is a subset of AI...",
"score": 0.9543,
"bm25_score": 15.23
}
],
"correction": "machine" # null if no correction
}What: Probabilistic ranking function, industry standard for keyword matching
How: Scores documents using term frequency and document length normalization
Best for: Exact keyword searches
Latency: ~100ms
What: Machine learning model that re-ranks BM25 results using multiple features
Features (7 total):
- BM25 score
- Document length
- Query length
- TF-IDF cosine similarity
- Query-document term overlap
- Title match count
- IDF sum of matching terms
Best for: Queries similar to training data
Latency: ~300ms (includes feature extraction)
Training: 128 MS MARCO queries with relevance labels
What: Neural embedding-based retrieval using sentence-transformers
How: Encodes query and documents into 384-dim vectors, ranks by cosine similarity
Model: all-MiniLM-L6-v2 (80MB, pre-trained on 1B+ sentence pairs)
Best for: Conceptual/semantic queries
Latency: ~500ms
What: Pseudo-relevance feedback that expands queries with related terms
Algorithm:
- First pass BM25 search
- Extract top terms from top 10 results using TF-IDF
- Append terms to query
- Second pass search
Best for: Ambiguous, short queries
Limitation: Performs poorly on MS MARCO's specific queries
Latency: ~250ms (two searches)
###5. Boolean Search
What: Structured queries with AND/OR/NOT operators
How: Direct Lucene query syntax
Example: machine AND learning NOT vision
Best for: Expert users needing precise control
Latency: ~100ms
marcode/
βββ app.py # Flask backend server
βββ search_utils.py # Core search logic (17 functions)
βββ LambdaMART.ipynb # Model training notebook
β
βββ templates/
β βββ index.html # Main UI
βββ static/
β βββ style.css # Styling
β βββ script.js # Frontend logic
β
βββ Model/
β βββ lambdamart_reranker_final.json # XGBoost model
β βββ tfidf_vectorizer.pkl # Feature extractor
β
βββ Dataset/
β βββ queries.docdev.tsv # Dev queries (5,193)
β βββ msmarco-docdev-qrels.tsv # Ground truth labels
β βββ ltr_features_full.csv # Extracted features
β
βββ PyseriniIndex/ # Lucene index (3.2M docs)
β
βββ evaluate.py # Main evaluation script
βββ evaluate_all_systems.py # All methods comparison
βββ evaluate_full.py # Full dataset evaluation
β
βββ tests/
βββ test_all_search.py
βββ test_semantic.py
βββ test_fuzzy_suggestions.py
βββ ... (15+ test scripts)
Core Application:
app.py(113 lines) - Flask server with 6 endpointssearch_utils.py(747 lines) - 17 search/utility functionstemplates/index.html- Single-page application UIstatic/script.js- Client-side search logicstatic/style.css- Professional dark theme styling
Machine Learning:
LambdaMART.ipynb- Model training pipelineModel/lambdamart_reranker_final.json- Trained XGBoost modelModel/tfidf_vectorizer.pkl- Pre-fitted TF-IDF vectorizer
Evaluation:
evaluate.py- Main evaluation frameworkevaluate_all_systems.py- Compare all 5 search methodsevaluate_full.py- Full dataset evaluation (128 queries)evaluate_friend.py- End-to-end integration test
Testing:
test_*.py- 15+ unit/integration testsdebug_*.py- Debugging utilities
Evaluate All Systems:
python evaluate_all_systems.pyOutput:
FINAL RESULTS (MRR@10 | nDCG@10)
=====================================
1. BM25: 0.2319 | 0.3046
2. LambdaMART: 0.4982 | 0.5513
3. Semantic: 0.4122 | 0.4798
4. RM3: 0.0712 | 0.1188
5. Boolean: 0.2319 | 0.3046
Individual Search Method:
# Test semantic search
python test_semantic.py
# Test fuzzy suggestions
python test_fuzzy_suggestions.py
# Test all search methods
python test_all_search.pyMRR@10 (Mean Reciprocal Rank)
- Measures: Position of first relevant result in top 10
- Formula: 1 / rank_of_first_relevant
- Range: 0.0 (worst) to 1.0 (best)
- Example: First relevant at rank 2 β MRR = 0.5
nDCG@10 (Normalized Discounted Cumulative Gain)
- Measures: Quality of entire ranking (considers all positions)
- Accounts for: Graded relevance (0, 1, 2, 3...)
- Range: 0.0 (worst) to 1.0 (perfect)
MS MARCO v1 Document Ranking
- Documents: 3.2 million web pages
- Dev Queries: 5,193 questions
- Qrels: Relevance judgments (0-3 scale)
- Domain: Real web search queries
- Format: Natural language questions
Technology: Hybrid fuzzy matching
- Strict Mode: Fast prefix matching (for exact starts)
- Fuzzy Mode: Levenshtein distance (for typos)
- Threshold: Dynamic (70-80 based on query length)
Examples:
microsftβ suggestsmicrosoftmachne learningβ suggestsmachine learninghwo toβ suggestshow to
Performance: <10ms for 5,193 query corpus
Trigger: Search returns <5 results or user enables
Logic:
- Hybrid scoring: 80% word-level, 20% partial ratio
- Threshold: 85% similarity
- Avoids: Substring matches (prevents "micro" β "microsoft")
UI: Yellow banner with clickable suggestion
Algorithm:
- Split document into sentences
- Score each sentence by query term overlap
- Return sentence with highest match
- Fallback: First 40 words if no match
Example:
- Query:
machine learning algorithms - Best Snippet: "Common algorithms include decision trees, neural networks..."
Performance: <1ms per document
Training Data:
- 128 queries with BM25 candidate pools
- ~128,000 query-document pairs
- Relevance labels from MS MARCO qrels
Model Parameters:
XGBRanker(
objective='rank:ndcg',
tree_method='hist',
eta=0.05,
max_depth=6,
n_estimators=100,
eval_metric='ndcg@10'
)Feature Engineering:
- F1: BM25 score (from Pyserini)
- F2: Document length (token count)
- F3: Query length (token count)
- F4: TF-IDF cosine similarity (query-doc)
- F5: Term overlap ratio (|query β© doc| / |query|)
- F6: Title match count
- F7: IDF sum of matching terms
Training Time: ~2 minutes on CPU
Model Size: 447KB
Inference: <50ms for 100 documents
Caching:
- TF-IDF vectorizer pre-fitted on dev queries
- Semantic model loaded once at startup
- Index kept in memory (3.2M docs)
Batching:
- Feature extraction batched by query
- Semantic encoding batched (100 docs at a time)
Latency Breakdown:
| Component | Time |
|---|---|
| BM25 Search | 80-120ms |
| Feature Extraction | 150-200ms |
| XGBoost Inference | 30-50ms |
| Snippet Generation | 10-20ms |
| Total (LambdaMART) | ~300ms |
Issue: Query expansion degrades quality on MS MARCO (-69% MRR)
Reason: MS MARCO queries are too specific ("what is botulinum toxin definition")
Solution: Use RM3 selectively for short/ambiguous queries
Status: Working as designed (dataset mismatch)
Issue: Model performs excellently on training distribution but struggles on novel queries
Reason: Only 128 training queries
Solution: Retrain on full dataset (5,193 queries)
Workaround: Falls back to BM25 for out-of-distribution queries
Issue: PyseriniIndex is ~4GB
Reason: Full MS MARCO corpus (3.2M docs)
Workaround: Use pre-built index from Anserini
Issue: First query takes 5-10s
Reason: Semantic model download + index loading
Solution: Models cache after first use
Contributions welcome! Areas for improvement:
- More Training Data: Expand LambdaMART training to full 5K queries
- Neural Re-ranking: Replace LambdaMART with BERT/T5 cross-encoder
- Query Understanding: Add NER, intent classification
- UI Enhancements: Result highlighting, filtering, pagination
- Deployment: Docker containerization, production WSGI server
Process:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
MS MARCO Dataset:
@article{DBLP:journals/corr/abs-1611-09268,
title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset},
author={Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li},
journal={arXiv preprint arXiv:1611.09268},
year={2016}
}Pyserini:
@inproceedings{Lin_etal_SIGIR2021_Pyserini,
title={Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations},
author={Lin, Jimmy and Ma, Xueguang and Lin, Sheng-Chieh and Yang, Jheng-Hong and Pradeep, Ronak and Nogueira, Rodrigo},
booktitle={SIGIR},
year={2021}
}LambdaMART:
@inproceedings{burges2010ranknet,
title={From RankNet to LambdaRank to LambdaMART: An overview},
author={Burges, Christopher JC},
booktitle={Learning},
volume={11},
number={23-581},
pages={81},
year={2010}
}This project is licensed under the MIT License - see LICENSE file for details.
Your Name
- GitHub: @yourusername
- Email: your.email@example.com
- MS MARCO Team for the dataset
- Pyserini/Anserini developers
- University of Waterloo (Jimmy Lin's lab)
- HuggingFace for pre-trained models
- Open source community
β If you find this project useful, please consider giving it a star!


