<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/evaluation/BeirEvaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BEIR Out of Domain Benchmark

About [BEIR](https://github.com/beir-cellar/beir):

BEIR is a heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework for evaluation of your retrieval methods within the benchmark.

Refer to the repo via the link for a full list of supported datasets.

Here, we test the `all-MiniLM-L6-v2` sentence-transformer embedding, which is one of the fastest for the given accuracy range. We set the top_k value for the retriever to 30. We also use the nfcorpus dataset.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [16]:
%pip install llama-index-embeddings-huggingface

In [17]:
!pip install llama-index

In [18]:
!pip install beir

In [19]:
import io
import sys
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.base.base_retriever import BaseRetriever
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import Document
from llama_index.core.evaluation.benchmarks import BeirEvaluator

# Fonction pour créer le retriever avec un modèle spécifié
def create_retriever(documents, model_name):
    embed_model = HuggingFaceEmbedding(model_name=model_name)
    index = VectorStoreIndex.from_documents(
        documents, embed_model=embed_model, show_progress=True
    )
    return index.as_retriever(similarity_top_k=30)

# Liste des modèles à tester
models_to_test = [
    "BAAI/bge-small-en-v1.5",
    "BAAI/bge-large-en",
    "sentence-transformers/all-mpnet-base-v2"
]

results = []

for model_name in models_to_test:
    print(f"Testing model: {model_name}")

    # Définir la fonction create_retriever avec le bon modèle
    retriever_function = lambda docs: create_retriever(docs, model_name)

    # Exécuter l'évaluation
    BeirEvaluator().run(
        retriever_function,
        datasets=["nfcorpus"],
        metrics_k_values=[3, 10, 30]
    )


  0%|          | 0/3633 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/3633 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1595 [00:00<?, ?it/s]

100%|██████████| 323/323 [00:37<00:00,  8.53it/s]


  0%|          | 0/3633 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/3633 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1595 [00:00<?, ?it/s]

100%|██████████| 323/323 [01:17<00:00,  4.18it/s]


  0%|          | 0/3633 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/3633 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1595 [00:00<?, ?it/s]

100%|██████████| 323/323 [01:00<00:00,  5.32it/s]


Higher is better for all the evaluation metrics.

This [towardsdatascience article](https://towardsdatascience.com/ranking-evaluation-metrics-for-recommender-systems-263d0a66ef54) covers NDCG, MAP and MRR in greater depth.