Skip to content

DheerajRamKalava/marcode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Marcode: Advanced Document Search Engine

A production-ready information retrieval system built on the MS MARCO dataset, featuring multiple search algorithms including BM25, LambdaMART re-ranking, semantic search, and query expansion techniques.

Search Engine UI

🌟 Overview

This project implements a comprehensive document search engine that combines traditional IR techniques with modern machine learning approaches. Built for the MS MARCO (Microsoft MAchine Reading COmprehension) Document Ranking dataset containing 3.2 million documents, it provides a web interface to compare different retrieval and ranking methods.

Key Features

  • 5 Search Methods: BM25, LambdaMART re-ranking, Semantic Search, Query Expansion (RM3), Boolean Search
  • Smart Autocomplete: Typo-tolerant fuzzy matching with "Did you mean?" suggestions
  • ML-Powered Re-ranking: XGBoost-based LambdaMART model with 7 handcrafted features
  • Semantic Understanding: Neural embeddings using sentence-transformers
  • Production-Ready: Flask backend, responsive UI, sub-second response times
  • Comprehensive Evaluation: Rigorous testing with MRR@10 and nDCG@10 metrics

πŸ“Š Performance Results

Evaluated on 50 MS MARCO dev queries:

Method MRR@10 nDCG@10 vs Baseline
BM25 (Baseline) 0.2319 0.3046 -
LambdaMART 0.4982 0.5513 +115% πŸ†
Semantic Search 0.4122 0.4798 +78%
RM3 (Query Expansion) 0.0712 0.1188 -69% *
Boolean Search 0.2319 0.3046 0% **

* RM3 underperforms on MS MARCO because queries are already highly specific
** Boolean search identical to BM25 on natural language queries (no operators)


πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Frontend (HTML/CSS/JavaScript)    β”‚
β”‚   β€’ Real-time autocomplete          β”‚
β”‚   β€’ 5 search mode selector           β”‚
β”‚   β€’ Query correction UI              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚ HTTP/JSON
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      Backend (Flask Server)         β”‚
β”‚   β€’ 6 API endpoints                 β”‚
β”‚   β€’ Model integration               β”‚
β”‚   β€’ Feature extraction              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚          β”‚          β”‚
β”Œβ”€β”€β”€β–Όβ”€β”€β”  β”Œβ”€β”€β”€β–Όβ”€β”€β”  β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”
β”‚Pyseriniβ”‚β”‚XGBoostβ”‚β”‚Semantic β”‚
β”‚ BM25  β”‚β”‚LambdaMβ”‚β”‚Transformβ”‚
β”‚3.2M   β”‚β”‚7 featsβ”‚β”‚ 384-dim β”‚
β”‚docs   β”‚β”‚       β”‚β”‚         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Technology Stack

Backend:

  • Python 3.8+
  • Flask (Web server)
  • Pyserini (BM25 search, Lucene integration)
  • XGBoost (LambdaMART ranking)
  • scikit-learn (TF-IDF, feature extraction)
  • sentence-transformers (Semantic embeddings)
  • thefuzz (Fuzzy string matching)
  • pandas, numpy (Data processing)

Frontend:

  • Vanilla HTML5/CSS3/JavaScript (No frameworks)
  • Google Fonts (Inter, Roboto Mono)
  • Font Awesome icons

Models & Data:

  • MS MARCO v1 Document Ranking (3.2M documents)
  • LambdaMART XGBoost model (446KB)
  • TF-IDF Vectorizer (70MB)
  • MiniLM-L6-v2 semantic model (80MB)

πŸš€ Quick Start

Prerequisites

  • Python 3.8 or higher
  • 8GB+ RAM (for loading models and index)
  • ~12GB disk space (for index and models)

Installation

  1. Clone the repository
git clone https://github.com/yourusername/marcode.git
cd marcode
  1. Install dependencies
pip install -r requirements.txt
  1. Download/Setup the Pyserini Index

The MS MARCO index is large (~4GB). You have two options:

Option A: Use Pre-built Index

# Index will auto-download on first run
# Or manually download from: https://git.uwaterloo.ca/jimmylin/anserini-indexes

Option B: Build from Scratch

# See Pyserini documentation for indexing MS MARCO documents
  1. Download Models (if not included)

The repository includes:

  • Model/lambdamart_reranker_final.json - LambdaMART model
  • Model/tfidf_vectorizer.pkl - TF-IDF vectorizer

The semantic model downloads automatically from HuggingFace on first use.

  1. Run the Application
python app.py

The server starts at http://localhost:5000


πŸ’» Usage

Web Interface

  1. Open http://localhost:5000 in your browser
  2. Select a search method from the dropdown:
    • Standard (BM25)
    • Re-ranking (LambdaMART)
    • Semantic Search
    • BM25 + Query Expansion (RM3)
    • Boolean Search
  3. Type your query (autocomplete suggestions appear)
  4. Press Enter or click Search
  5. View ranked results with smart snippets

Example Queries

Natural Language:

  • "what is machine learning"
  • "how do vaccines work"
  • "who is the current microsoft ceo"

Boolean (advanced):

  • "machine AND learning NOT vision"
  • "python AND (programming OR coding)"
  • "vaccine AND (efficacy OR effectiveness)"

API Endpoints

Search

POST /search
Content-Type: application/json

{
  "query": "machine learning",
  "method": "lambdamart"  # bm25, lambdamart, semantic, bm25_rm3, boolean
}

Autocomplete

GET /suggestions?q=machne

Response

{
  "results": [
    {
      "rank": 1,
      "doc_id": "D881266",
      "title": "What is Machine Learning?",
      "url": "https://...",
      "snippet": "Machine learning is a subset of AI...",
      "score": 0.9543,
      "bm25_score": 15.23
    }
  ],
  "correction": "machine"  # null if no correction
}

πŸ” Search Methods Explained

1. BM25 (Baseline)

What: Probabilistic ranking function, industry standard for keyword matching
How: Scores documents using term frequency and document length normalization
Best for: Exact keyword searches
Latency: ~100ms

2. LambdaMART Re-ranking

What: Machine learning model that re-ranks BM25 results using multiple features
Features (7 total):

  1. BM25 score
  2. Document length
  3. Query length
  4. TF-IDF cosine similarity
  5. Query-document term overlap
  6. Title match count
  7. IDF sum of matching terms

Best for: Queries similar to training data
Latency: ~300ms (includes feature extraction)
Training: 128 MS MARCO queries with relevance labels

3. Semantic Search

What: Neural embedding-based retrieval using sentence-transformers
How: Encodes query and documents into 384-dim vectors, ranks by cosine similarity
Model: all-MiniLM-L6-v2 (80MB, pre-trained on 1B+ sentence pairs)
Best for: Conceptual/semantic queries
Latency: ~500ms

4. Query Expansion (RM3)

What: Pseudo-relevance feedback that expands queries with related terms
Algorithm:

  1. First pass BM25 search
  2. Extract top terms from top 10 results using TF-IDF
  3. Append terms to query
  4. Second pass search

Best for: Ambiguous, short queries
Limitation: Performs poorly on MS MARCO's specific queries
Latency: ~250ms (two searches)

###5. Boolean Search

What: Structured queries with AND/OR/NOT operators
How: Direct Lucene query syntax
Example: machine AND learning NOT vision
Best for: Expert users needing precise control
Latency: ~100ms


πŸ“ Project Structure

marcode/
β”œβ”€β”€ app.py                      # Flask backend server
β”œβ”€β”€ search_utils.py             # Core search logic (17 functions)
β”œβ”€β”€ LambdaMART.ipynb           # Model training notebook
β”‚
β”œβ”€β”€ templates/
β”‚   └── index.html             # Main UI
β”œβ”€β”€ static/
β”‚   β”œβ”€β”€ style.css              # Styling
β”‚   └── script.js              # Frontend logic
β”‚
β”œβ”€β”€ Model/
β”‚   β”œβ”€β”€ lambdamart_reranker_final.json  # XGBoost model
β”‚   └── tfidf_vectorizer.pkl            # Feature extractor
β”‚
β”œβ”€β”€ Dataset/
β”‚   β”œβ”€β”€ queries.docdev.tsv              # Dev queries (5,193)
β”‚   β”œβ”€β”€ msmarco-docdev-qrels.tsv        # Ground truth labels
β”‚   └── ltr_features_full.csv           # Extracted features
β”‚
β”œβ”€β”€ PyseriniIndex/              # Lucene index (3.2M docs)
β”‚
β”œβ”€β”€ evaluate.py                 # Main evaluation script
β”œβ”€β”€ evaluate_all_systems.py     # All methods comparison
β”œβ”€β”€ evaluate_full.py            # Full dataset evaluation
β”‚
└── tests/
    β”œβ”€β”€ test_all_search.py
    β”œβ”€β”€ test_semantic.py
    β”œβ”€β”€ test_fuzzy_suggestions.py
    └── ... (15+ test scripts)

Key Files

Core Application:

  • app.py (113 lines) - Flask server with 6 endpoints
  • search_utils.py (747 lines) - 17 search/utility functions
  • templates/index.html - Single-page application UI
  • static/script.js - Client-side search logic
  • static/style.css - Professional dark theme styling

Machine Learning:

  • LambdaMART.ipynb - Model training pipeline
  • Model/lambdamart_reranker_final.json - Trained XGBoost model
  • Model/tfidf_vectorizer.pkl - Pre-fitted TF-IDF vectorizer

Evaluation:

  • evaluate.py - Main evaluation framework
  • evaluate_all_systems.py - Compare all 5 search methods
  • evaluate_full.py - Full dataset evaluation (128 queries)
  • evaluate_friend.py - End-to-end integration test

Testing:

  • test_*.py - 15+ unit/integration tests
  • debug_*.py - Debugging utilities

πŸ§ͺ Evaluation & Testing

Running Evaluations

Evaluate All Systems:

python evaluate_all_systems.py

Output:

FINAL RESULTS (MRR@10 | nDCG@10)
=====================================
1. BM25: 0.2319 | 0.3046
2. LambdaMART: 0.4982 | 0.5513
3. Semantic: 0.4122 | 0.4798
4. RM3: 0.0712 | 0.1188
5. Boolean: 0.2319 | 0.3046

Individual Search Method:

# Test semantic search
python test_semantic.py

# Test fuzzy suggestions
python test_fuzzy_suggestions.py

# Test all search methods
python test_all_search.py

Metrics Explained

MRR@10 (Mean Reciprocal Rank)

  • Measures: Position of first relevant result in top 10
  • Formula: 1 / rank_of_first_relevant
  • Range: 0.0 (worst) to 1.0 (best)
  • Example: First relevant at rank 2 β†’ MRR = 0.5

nDCG@10 (Normalized Discounted Cumulative Gain)

  • Measures: Quality of entire ranking (considers all positions)
  • Accounts for: Graded relevance (0, 1, 2, 3...)
  • Range: 0.0 (worst) to 1.0 (perfect)

Dataset

MS MARCO v1 Document Ranking

  • Documents: 3.2 million web pages
  • Dev Queries: 5,193 questions
  • Qrels: Relevance judgments (0-3 scale)
  • Domain: Real web search queries
  • Format: Natural language questions

✨ Features Deep Dive

Smart Autocomplete

Technology: Hybrid fuzzy matching

  • Strict Mode: Fast prefix matching (for exact starts)
  • Fuzzy Mode: Levenshtein distance (for typos)
  • Threshold: Dynamic (70-80 based on query length)

Examples:

  • microsft β†’ suggests microsoft
  • machne learning β†’ suggests machine learning
  • hwo to β†’ suggests how to

Performance: <10ms for 5,193 query corpus

"Did You Mean?" Correction

Trigger: Search returns <5 results or user enables
Logic:

  • Hybrid scoring: 80% word-level, 20% partial ratio
  • Threshold: 85% similarity
  • Avoids: Substring matches (prevents "micro" β†’ "microsoft")

UI: Yellow banner with clickable suggestion

Smart Snippets

Algorithm:

  1. Split document into sentences
  2. Score each sentence by query term overlap
  3. Return sentence with highest match
  4. Fallback: First 40 words if no match

Example:

  • Query: machine learning algorithms
  • Best Snippet: "Common algorithms include decision trees, neural networks..."

Performance: <1ms per document


πŸŽ“ Technical Details

LambdaMART Training

Training Data:

  • 128 queries with BM25 candidate pools
  • ~128,000 query-document pairs
  • Relevance labels from MS MARCO qrels

Model Parameters:

XGBRanker(
    objective='rank:ndcg',
    tree_method='hist',
    eta=0.05,
    max_depth=6,
    n_estimators=100,
    eval_metric='ndcg@10'
)

Feature Engineering:

  • F1: BM25 score (from Pyserini)
  • F2: Document length (token count)
  • F3: Query length (token count)
  • F4: TF-IDF cosine similarity (query-doc)
  • F5: Term overlap ratio (|query ∩ doc| / |query|)
  • F6: Title match count
  • F7: IDF sum of matching terms

Training Time: ~2 minutes on CPU
Model Size: 447KB
Inference: <50ms for 100 documents

Performance Optimization

Caching:

  • TF-IDF vectorizer pre-fitted on dev queries
  • Semantic model loaded once at startup
  • Index kept in memory (3.2M docs)

Batching:

  • Feature extraction batched by query
  • Semantic encoding batched (100 docs at a time)

Latency Breakdown:

Component Time
BM25 Search 80-120ms
Feature Extraction 150-200ms
XGBoost Inference 30-50ms
Snippet Generation 10-20ms
Total (LambdaMART) ~300ms

πŸ› Known Issues & Limitations

1. RM3 Underperformance

Issue: Query expansion degrades quality on MS MARCO (-69% MRR)
Reason: MS MARCO queries are too specific ("what is botulinum toxin definition")
Solution: Use RM3 selectively for short/ambiguous queries
Status: Working as designed (dataset mismatch)

2. LambdaMART Overfitting

Issue: Model performs excellently on training distribution but struggles on novel queries
Reason: Only 128 training queries
Solution: Retrain on full dataset (5,193 queries)
Workaround: Falls back to BM25 for out-of-distribution queries

3. Index Size

Issue: PyseriniIndex is ~4GB
Reason: Full MS MARCO corpus (3.2M docs)
Workaround: Use pre-built index from Anserini

4. Cold Start Latency

Issue: First query takes 5-10s
Reason: Semantic model download + index loading
Solution: Models cache after first use


🀝 Contributing

Contributions welcome! Areas for improvement:

  1. More Training Data: Expand LambdaMART training to full 5K queries
  2. Neural Re-ranking: Replace LambdaMART with BERT/T5 cross-encoder
  3. Query Understanding: Add NER, intent classification
  4. UI Enhancements: Result highlighting, filtering, pagination
  5. Deployment: Docker containerization, production WSGI server

Process:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ Citations

MS MARCO Dataset:

@article{DBLP:journals/corr/abs-1611-09268,
  title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset},
  author={Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li},
  journal={arXiv preprint arXiv:1611.09268},
  year={2016}
}

Pyserini:

@inproceedings{Lin_etal_SIGIR2021_Pyserini,
  title={Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations},
  author={Lin, Jimmy and Ma, Xueguang and Lin, Sheng-Chieh and Yang, Jheng-Hong and Pradeep, Ronak and Nogueira, Rodrigo},
  booktitle={SIGIR},
  year={2021}
}

LambdaMART:

@inproceedings{burges2010ranknet,
  title={From RankNet to LambdaRank to LambdaMART: An overview},
  author={Burges, Christopher JC},
  booktitle={Learning},
  volume={11},
  number={23-581},
  pages={81},
  year={2010}
}

πŸ“œ License

This project is licensed under the MIT License - see LICENSE file for details.


πŸ‘₯ Authors

Your Name


πŸ™ Acknowledgments

  • MS MARCO Team for the dataset
  • Pyserini/Anserini developers
  • University of Waterloo (Jimmy Lin's lab)
  • HuggingFace for pre-trained models
  • Open source community

πŸ“Έ Screenshots

Main Interface

Main Interface

Search Results

Search Results

Query Suggestions

Autocomplete


πŸ”— Links


⭐ If you find this project useful, please consider giving it a star!

About

Search Engine built on MS Marco Dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors