RankSaga BEIR Benchmarking

This repository contains scripts, results, and documentation for RankSaga's comprehensive BEIR (Benchmarking IR) benchmarking study. We evaluated embedding model optimization techniques and achieved significant improvements, including up to 51% improvement on medical information retrieval tasks.

Key Results

Our optimized models show substantial improvements across multiple BEIR datasets:

NFE Corpus (Medical): +15-51% improvement across all metrics
SciDocs (Scientific): +3-20% improvement
Quora (General): Maintained high baseline performance
Maximum Gain: 51% improvement on Recall@100 for medical information retrieval

📊 View Detailed Results and Visualizations

Quick Start

Using the Optimized Model

The fine-tuned model is available on Hugging Face:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("RankSaga/ranksaga-optimized-e5-v2")
embeddings = model.encode(["Your text here"])

Download from Hugging Face →

Running the Benchmark

# Install dependencies
pip install -r requirements.txt

# Run baseline evaluation
python run_baseline.py

# Fine-tune model
python fine_tune_model.py

# Evaluate optimized model
python run_optimized.py

# Compare results
python compare_results.py

📖 Overview

This benchmarking suite allows you to:

Baseline Evaluation: Benchmark pre-trained models (e.g., intfloat/e5-base-v2) on BEIR datasets
Model Fine-Tuning: Apply RankSaga optimization techniques to improve model performance
Optimized Evaluation: Benchmark fine-tuned models to measure improvements
Results Comparison: Compare baseline vs optimized performance with visualizations
Reproduce Results: All code, data, and configurations are provided

Setup

Option 1: Run on Modal.com (Recommended - Fast GPU)

Best for: Fast execution, no local GPU needed, pay-per-use

See MODAL_SETUP.md for complete instructions.

Quick start:

pip install modal
modal token new
modal deploy modal_app.py
modal run modal_app.py

Option 2: Run Locally

Prerequisites

Python 3.8 or higher
CUDA-capable GPU (recommended for faster training and evaluation)

Installation

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Configuration

Edit config.py to customize:

Base model to fine-tune
BEIR datasets to benchmark
Training hyperparameters (epochs, batch size, learning rate)
Output paths

Usage

1. Run Baseline Benchmarking

Evaluate the pre-trained model to establish baseline performance:

python run_baseline.py

This will:

Download BEIR datasets if not present
Evaluate the base model on all configured datasets
Save results to results/baseline/

Expected time: 10-30 minutes depending on datasets and hardware

2. Fine-Tune Model

Apply RankSaga optimization techniques to improve the model:

python fine_tune_model.py

This will:

Load the base model
Prepare training data from BEIR datasets
Fine-tune using Multiple Negatives Ranking Loss
Save optimized model to models/ranksaga-optimized-e5-v1/

Expected time: 30-120 minutes depending on training data size and epochs

3. Benchmark Optimized Model

Evaluate the fine-tuned model:

python run_optimized.py

This will:

Load the fine-tuned model
Evaluate on the same BEIR datasets
Save results to results/optimized/

Expected time: 10-30 minutes

4. Compare Results

Generate comparison reports and visualizations:

python compare_results.py

This will:

Load baseline and optimized results
Calculate improvement percentages
Generate comparison table (CSV)
Create visualization charts
Generate markdown report

Output files:

results/comparison_table.csv - Detailed comparison data
results/comparison_visualization.png - Performance charts
results/comparison_report.md - Human-readable report

File Structure

benchmarking/
├── README.md                 # This file
├── requirements.txt          # Python dependencies
├── config.py                 # Configuration settings
├── run_baseline.py           # Baseline benchmarking script
├── fine_tune_model.py        # Model fine-tuning script
├── run_optimized.py          # Optimized model benchmarking
├── compare_results.py        # Results comparison and visualization
├── utils/
│   ├── data_loader.py        # BEIR dataset utilities
│   └── evaluation.py         # Evaluation metrics helpers
├── models/                   # Saved fine-tuned models (gitignored)
├── results/                  # Benchmark results (gitignored)
│   ├── baseline/
│   └── optimized/
└── datasets/                # Downloaded BEIR datasets (gitignored)

BEIR Datasets

By default, the following datasets are used:

scifact: Scientific fact-checking (300 queries, 5K documents)
nfcorpus: Medical information retrieval (323 queries, 3.6K documents)
scidocs: Scientific document retrieval (1K queries, 25K documents)
quora: Duplicate question detection (10K queries, 523K documents)

You can modify the dataset list in config.py.

Understanding Results

Metrics

NDCG@10: Normalized Discounted Cumulative Gain at rank 10 (primary metric)
NDCG@100: NDCG at rank 100
MAP@100: Mean Average Precision at rank 100
Recall@100: Recall at rank 100
Precision@100: Precision at rank 100

Interpreting Improvements

Positive improvement %: The optimized model performs better
Negative improvement %: The optimized model performs worse (may indicate overfitting)
Average improvement: Overall performance gain across all datasets

Troubleshooting

Out of Memory Errors

Reduce batch_size in config.py
Use fewer datasets for training
Use a smaller base model

Slow Training

Use GPU if available (PyTorch will automatically detect)
Reduce number of epochs
Use a subset of training data

Missing Datasets

Datasets are automatically downloaded on first use
Check internet connection
Verify dataset names in config.py match BEIR dataset names

Model Loading Errors

Ensure fine_tune_model.py completed successfully
Check that model path in config.py matches saved model location
Verify model files exist in models/ directory

Advanced Usage

Custom Training Configuration

Edit config.py to modify:

TRAINING_CONFIG = {
    "epochs": 5,              # Increase for better results (slower)
    "batch_size": 32,         # Increase if you have more GPU memory
    "learning_rate": 1e-5,    # Lower for more stable training
    "warmup_steps": 200,      # More warmup for larger models
}

Adding More Datasets

Add dataset names to BEIR_DATASETS in config.py:

BEIR_DATASETS = ["scifact", "nfcorpus", "scidocs", "quora", "msmarco", "natural-questions"]

Using Different Base Models

Change BASE_MODEL in config.py:

BASE_MODEL = "sentence-transformers/all-mpnet-base-v2"  # Alternative model

Citation

If you use BEIR in your research, please cite:

@inproceedings{thakur2021beir,
    title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models},
    author={Nandan Thakur and Nils Reimers and Andreas Rücklé and Abhishek Srivastava and Iryna Gurevych},
    booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
    year={2021},
    url={https://openreview.net/forum?id=wCu6T5xFjeJ}
}

📊 Results Summary

Performance Improvements

Dataset	NDCG@10	NDCG@100	MAP@100	Recall@100
NFE Corpus	+15.25%	+32.62%	+49.49%	+51.03%
SciDocs	+3.14%	+11.82%	+7.70%	+20.21%
Quora	-0.81%	-0.79%	-0.90%	-0.39%
SciFact	-26.28%	-22.82%	-28.61%	-7.71%

See detailed results in results/ directory and our blog post

Visualizations

Comprehensive visualizations are available:

Performance comparison charts
Improvement heatmaps
Per-dataset analysis
Domain-specific comparisons

Generate visualizations:

python generate_blog_visualizations.py

📚 Documentation

Methodology: Detailed explanation of our fine-tuning approach
Datasets: Information about BEIR datasets used
Modal Setup: Guide for running on Modal.com cloud infrastructure

🔬 Methodology

Our optimization approach uses:

Base Model: intfloat/e5-base-v2
Loss Function: Multiple Negatives Ranking Loss
Training: 5 epochs, batch size 32, learning rate 1e-5
Data: All BEIR dataset splits (train/dev/test)
Hardware: A10G GPU on Modal.com

See docs/METHODOLOGY.md for complete details.

🎁 Open Source Contributions

We're committed to open-source AI research. This repository includes:

✅ Complete benchmarking code
✅ Fine-tuning scripts and configurations
✅ Evaluation results and analysis
✅ Visualization generation tools
✅ Model export utilities for Hugging Face

🤝 Contributing

Contributions are welcome! Please feel free to:

Open issues for bugs or questions
Submit pull requests for improvements
Share results from different configurations
Provide feedback on methodology

📝 Citation

If you use our models or reference our work:

@misc{ranksaga-beir-2026,
  title={BEIR Benchmarking Results: RankSaga Embedding Model Optimization},
  author={RankSaga},
  year={2026},
  url={https://github.com/RankSaga/bier-benchmarking}
}

For BEIR benchmark:

@inproceedings{thakur2021beir,
    title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models},
    author={Nandan Thakur and Nils Reimers and Andreas Rücklé and Abhishek Srivastava and Iryna Gurevych},
    booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
    year={2021},
    url={https://openreview.net/forum?id=wCu6T5xFjeJ}
}

🔗 Links

Model on Hugging Face: RankSaga/ranksaga-optimized-e5-v2
Blog Post: BEIR Benchmarking Results
RankSaga Website: https://ranksaga.com
Contact: Get in Touch

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

The BEIR benchmark itself is licensed under Apache 2.0.

🙏 Acknowledgments

BEIR benchmark creators for the comprehensive evaluation framework
E5 model authors for the excellent base model
Sentence Transformers library for the powerful framework
Modal.com for GPU infrastructure

💼 Commercial Use

RankSaga provides enterprise AI consulting services including:

Custom embedding model optimization
Semantic search solutions
Vector database management
LLM training and fine-tuning

Contact us for commercial inquiries.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
examples		examples
results		results
utils		utils
.gitignore		.gitignore
ADVANCED_TRAINING.md		ADVANCED_TRAINING.md
CLOUD_SETUP.md		CLOUD_SETUP.md
FINETUNE_COMPLETE.md		FINETUNE_COMPLETE.md
GITHUB_SETUP.md		GITHUB_SETUP.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
IMPROVEMENTS_SUMMARY.md		IMPROVEMENTS_SUMMARY.md
LICENSE		LICENSE
MODAL_SETUP.md		MODAL_SETUP.md
NEXT_ACTIONS.md		NEXT_ACTIONS.md
NEXT_STEPS.md		NEXT_STEPS.md
QUICK_STATUS.md		QUICK_STATUS.md
README.md		README.md
START_FRESH.md		START_FRESH.md
STATUS.md		STATUS.md
SUMMARY.md		SUMMARY.md
WORKFLOW_COMPLETE.md		WORKFLOW_COMPLETE.md
WORKFLOW_STATUS.md		WORKFLOW_STATUS.md
check_status.sh		check_status.sh
compare_domain_results.py		compare_domain_results.py
compare_results.py		compare_results.py
config.py		config.py
create_domain_summary.py		create_domain_summary.py
download_results.sh		download_results.sh
export_model_for_hf.py		export_model_for_hf.py
fine_tune_model.py		fine_tune_model.py
generate_blog_visualizations.py		generate_blog_visualizations.py
get_results.py		get_results.py
modal_app.py		modal_app.py
modal_app_domain_specific.py		modal_app_domain_specific.py
modal_app_scifact.py		modal_app_scifact.py
modal_fine_tune.py		modal_fine_tune.py
modal_fine_tune_advanced.py		modal_fine_tune_advanced.py
modal_fine_tune_domain_specific.py		modal_fine_tune_domain_specific.py
modal_fine_tune_scifact.py		modal_fine_tune_scifact.py
requirements.txt		requirements.txt
run_baseline.py		run_baseline.py
run_benchmark_colab.ipynb		run_benchmark_colab.ipynb
run_domain_specific_workflow.sh		run_domain_specific_workflow.sh
run_finetune_simple.sh		run_finetune_simple.sh
run_fresh_workflow.sh		run_fresh_workflow.sh
run_optimized.py		run_optimized.py
run_optimized_detached.sh		run_optimized_detached.sh
run_workflow.sh		run_workflow.sh
upload_to_hf.py		upload_to_hf.py

License

RankSaga/bier-benchmarking

Folders and files

Latest commit

History

Repository files navigation

RankSaga BEIR Benchmarking

Key Results

Quick Start

Using the Optimized Model

Running the Benchmark

📖 Overview

Setup

Option 1: Run on Modal.com (Recommended - Fast GPU)

Option 2: Run Locally

Prerequisites

Installation

Configuration

Usage

1. Run Baseline Benchmarking

2. Fine-Tune Model

3. Benchmark Optimized Model

4. Compare Results

File Structure

BEIR Datasets

Understanding Results

Metrics

Interpreting Improvements

Troubleshooting

Out of Memory Errors

Slow Training

Missing Datasets

Model Loading Errors

Advanced Usage

Custom Training Configuration

Adding More Datasets

Using Different Base Models

Citation

📊 Results Summary

Performance Improvements

Visualizations

📚 Documentation

🔬 Methodology

🎁 Open Source Contributions

🤝 Contributing

📝 Citation

🔗 Links

📄 License

🙏 Acknowledgments

💼 Commercial Use

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages