This repository contains scripts, results, and documentation for RankSaga's comprehensive BEIR (Benchmarking IR) benchmarking study. We evaluated embedding model optimization techniques and achieved significant improvements, including up to 51% improvement on medical information retrieval tasks.
Our optimized models show substantial improvements across multiple BEIR datasets:
- NFE Corpus (Medical): +15-51% improvement across all metrics
- SciDocs (Scientific): +3-20% improvement
- Quora (General): Maintained high baseline performance
- Maximum Gain: 51% improvement on Recall@100 for medical information retrieval
📊 View Detailed Results and Visualizations
The fine-tuned model is available on Hugging Face:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("RankSaga/ranksaga-optimized-e5-v2")
embeddings = model.encode(["Your text here"])# Install dependencies
pip install -r requirements.txt
# Run baseline evaluation
python run_baseline.py
# Fine-tune model
python fine_tune_model.py
# Evaluate optimized model
python run_optimized.py
# Compare results
python compare_results.pyThis benchmarking suite allows you to:
- Baseline Evaluation: Benchmark pre-trained models (e.g.,
intfloat/e5-base-v2) on BEIR datasets - Model Fine-Tuning: Apply RankSaga optimization techniques to improve model performance
- Optimized Evaluation: Benchmark fine-tuned models to measure improvements
- Results Comparison: Compare baseline vs optimized performance with visualizations
- Reproduce Results: All code, data, and configurations are provided
Best for: Fast execution, no local GPU needed, pay-per-use
See MODAL_SETUP.md for complete instructions.
Quick start:
pip install modal
modal token new
modal deploy modal_app.py
modal run modal_app.py- Python 3.8 or higher
- CUDA-capable GPU (recommended for faster training and evaluation)
- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtEdit config.py to customize:
- Base model to fine-tune
- BEIR datasets to benchmark
- Training hyperparameters (epochs, batch size, learning rate)
- Output paths
Evaluate the pre-trained model to establish baseline performance:
python run_baseline.pyThis will:
- Download BEIR datasets if not present
- Evaluate the base model on all configured datasets
- Save results to
results/baseline/
Expected time: 10-30 minutes depending on datasets and hardware
Apply RankSaga optimization techniques to improve the model:
python fine_tune_model.pyThis will:
- Load the base model
- Prepare training data from BEIR datasets
- Fine-tune using Multiple Negatives Ranking Loss
- Save optimized model to
models/ranksaga-optimized-e5-v1/
Expected time: 30-120 minutes depending on training data size and epochs
Evaluate the fine-tuned model:
python run_optimized.pyThis will:
- Load the fine-tuned model
- Evaluate on the same BEIR datasets
- Save results to
results/optimized/
Expected time: 10-30 minutes
Generate comparison reports and visualizations:
python compare_results.pyThis will:
- Load baseline and optimized results
- Calculate improvement percentages
- Generate comparison table (CSV)
- Create visualization charts
- Generate markdown report
Output files:
results/comparison_table.csv- Detailed comparison dataresults/comparison_visualization.png- Performance chartsresults/comparison_report.md- Human-readable report
benchmarking/
├── README.md # This file
├── requirements.txt # Python dependencies
├── config.py # Configuration settings
├── run_baseline.py # Baseline benchmarking script
├── fine_tune_model.py # Model fine-tuning script
├── run_optimized.py # Optimized model benchmarking
├── compare_results.py # Results comparison and visualization
├── utils/
│ ├── data_loader.py # BEIR dataset utilities
│ └── evaluation.py # Evaluation metrics helpers
├── models/ # Saved fine-tuned models (gitignored)
├── results/ # Benchmark results (gitignored)
│ ├── baseline/
│ └── optimized/
└── datasets/ # Downloaded BEIR datasets (gitignored)
By default, the following datasets are used:
- scifact: Scientific fact-checking (300 queries, 5K documents)
- nfcorpus: Medical information retrieval (323 queries, 3.6K documents)
- scidocs: Scientific document retrieval (1K queries, 25K documents)
- quora: Duplicate question detection (10K queries, 523K documents)
You can modify the dataset list in config.py.
- NDCG@10: Normalized Discounted Cumulative Gain at rank 10 (primary metric)
- NDCG@100: NDCG at rank 100
- MAP@100: Mean Average Precision at rank 100
- Recall@100: Recall at rank 100
- Precision@100: Precision at rank 100
- Positive improvement %: The optimized model performs better
- Negative improvement %: The optimized model performs worse (may indicate overfitting)
- Average improvement: Overall performance gain across all datasets
- Reduce
batch_sizeinconfig.py - Use fewer datasets for training
- Use a smaller base model
- Use GPU if available (PyTorch will automatically detect)
- Reduce number of epochs
- Use a subset of training data
- Datasets are automatically downloaded on first use
- Check internet connection
- Verify dataset names in
config.pymatch BEIR dataset names
- Ensure
fine_tune_model.pycompleted successfully - Check that model path in
config.pymatches saved model location - Verify model files exist in
models/directory
Edit config.py to modify:
TRAINING_CONFIG = {
"epochs": 5, # Increase for better results (slower)
"batch_size": 32, # Increase if you have more GPU memory
"learning_rate": 1e-5, # Lower for more stable training
"warmup_steps": 200, # More warmup for larger models
}Add dataset names to BEIR_DATASETS in config.py:
BEIR_DATASETS = ["scifact", "nfcorpus", "scidocs", "quora", "msmarco", "natural-questions"]Change BASE_MODEL in config.py:
BASE_MODEL = "sentence-transformers/all-mpnet-base-v2" # Alternative modelIf you use BEIR in your research, please cite:
@inproceedings{thakur2021beir,
title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models},
author={Nandan Thakur and Nils Reimers and Andreas Rücklé and Abhishek Srivastava and Iryna Gurevych},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
year={2021},
url={https://openreview.net/forum?id=wCu6T5xFjeJ}
}| Dataset | NDCG@10 | NDCG@100 | MAP@100 | Recall@100 |
|---|---|---|---|---|
| NFE Corpus | +15.25% | +32.62% | +49.49% | +51.03% |
| SciDocs | +3.14% | +11.82% | +7.70% | +20.21% |
| Quora | -0.81% | -0.79% | -0.90% | -0.39% |
| SciFact | -26.28% | -22.82% | -28.61% | -7.71% |
See detailed results in results/ directory and our blog post
Comprehensive visualizations are available:
- Performance comparison charts
- Improvement heatmaps
- Per-dataset analysis
- Domain-specific comparisons
Generate visualizations:
python generate_blog_visualizations.py- Methodology: Detailed explanation of our fine-tuning approach
- Datasets: Information about BEIR datasets used
- Modal Setup: Guide for running on Modal.com cloud infrastructure
Our optimization approach uses:
- Base Model:
intfloat/e5-base-v2 - Loss Function: Multiple Negatives Ranking Loss
- Training: 5 epochs, batch size 32, learning rate 1e-5
- Data: All BEIR dataset splits (train/dev/test)
- Hardware: A10G GPU on Modal.com
See docs/METHODOLOGY.md for complete details.
We're committed to open-source AI research. This repository includes:
- ✅ Complete benchmarking code
- ✅ Fine-tuning scripts and configurations
- ✅ Evaluation results and analysis
- ✅ Visualization generation tools
- ✅ Model export utilities for Hugging Face
Contributions are welcome! Please feel free to:
- Open issues for bugs or questions
- Submit pull requests for improvements
- Share results from different configurations
- Provide feedback on methodology
If you use our models or reference our work:
@misc{ranksaga-beir-2026,
title={BEIR Benchmarking Results: RankSaga Embedding Model Optimization},
author={RankSaga},
year={2026},
url={https://github.com/RankSaga/bier-benchmarking}
}For BEIR benchmark:
@inproceedings{thakur2021beir,
title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models},
author={Nandan Thakur and Nils Reimers and Andreas Rücklé and Abhishek Srivastava and Iryna Gurevych},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
year={2021},
url={https://openreview.net/forum?id=wCu6T5xFjeJ}
}- Model on Hugging Face: RankSaga/ranksaga-optimized-e5-v2
- Blog Post: BEIR Benchmarking Results
- RankSaga Website: https://ranksaga.com
- Contact: Get in Touch
This project is licensed under the MIT License - see the LICENSE file for details.
The BEIR benchmark itself is licensed under Apache 2.0.
- BEIR benchmark creators for the comprehensive evaluation framework
- E5 model authors for the excellent base model
- Sentence Transformers library for the powerful framework
- Modal.com for GPU infrastructure
RankSaga provides enterprise AI consulting services including:
- Custom embedding model optimization
- Semantic search solutions
- Vector database management
- LLM training and fine-tuning
Contact us for commercial inquiries.