<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/evaluation/BeirEvaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BEIR Out of Domain Benchmark

About [BEIR](https://github.com/beir-cellar/beir):

BEIR is a heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework for evaluation of your retrieval methods within the benchmark.

Refer to the repo via the link for a full list of supported datasets.

Here, we test the `all-MiniLM-L6-v2` sentence-transformer embedding, which is one of the fastest for the given accuracy range. We set the top_k value for the retriever to 30. We also use the nfcorpus dataset.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [5]:
%pip install llama-index-embeddings-huggingface



In [6]:
!pip install llama-index



In [8]:
!pip install beir

Collecting beir
  Downloading beir-2.0.0.tar.gz (53 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/53.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.6/53.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pytrec_eval (from beir)
  Downloading pytrec_eval-0.5.tar.gz (15 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting faiss_cpu (from beir)
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting elasticsearch==7.9.1 (from beir)
  Downloading elasticsearch-7.9.1-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting datasets (from beir)
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets->beir)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets->beir)
  Downloa

In [10]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.evaluation.benchmarks import BeirEvaluator
from llama_index.core import VectorStoreIndex
import pandas as pd

# List of models to evaluate
models_to_test = [
    "BAAI/bge-small-en-v1.5",
    "sentence-transformers/all-MiniLM-L6-v2",
    "sentence-transformers/multi-qa-MiniLM-L6-cos-v1",
    "BAAI/bge-large-en-v1.5",
]

# Datasets for evaluation
datasets = ["nfcorpus"]

# Initialize results storage
results = []

# Evaluation function
def create_retriever(documents, model_name):
    embed_model = HuggingFaceEmbedding(model_name=model_name)
    index = VectorStoreIndex.from_documents(
        documents, embed_model=embed_model, show_progress=True
    )
    return index.as_retriever(similarity_top_k=30)

scores = {}

# Loop through models and evaluate
for model_name in models_to_test:
    print(f"Evaluating model: {model_name}")

    # Define a lambda function to pass the model_name dynamically
    retriever_creator = lambda documents: create_retriever(documents, model_name)

    # Run evaluation
    evaluator = BeirEvaluator()
    scores = evaluator.run(
        retriever_creator,
        datasets=datasets,
        metrics_k_values=[3, 10, 30]
    )

    # Store the model's scores
    scores["model"] = model_name
    results.append(scores)

# Create a DataFrame to compare models
df_results = pd.DataFrame(results)

# Display sorted results by the primary metric (e.g., NDCG@10)
df_sorted = df_results.sort_values(by="NDCG@10", ascending=False)
print(df_sorted)


Evaluating model: BAAI/bge-small-en-v1.5
Dataset: nfcorpus downloaded at: /tmp/llama_index/datasets/BeIR__nfcorpus
Evaluating on dataset: nfcorpus
-------------------------------------


  0%|          | 0/3633 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/3633 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1595 [00:00<?, ?it/s]

Retriever created for:  nfcorpus
Evaluating retriever on questions against qrels


100%|██████████| 323/323 [00:37<00:00,  8.67it/s]


Results for: nfcorpus
{'NDCG@3': 0.38766, 'MAP@3': 0.0847, 'Recall@3': 0.09833, 'precision@3': 0.37255}
{'NDCG@10': 0.3361, 'MAP@10': 0.11749, 'Recall@10': 0.15986, 'precision@10': 0.25325}
{'NDCG@30': 0.30161, 'MAP@30': 0.13719, 'Recall@30': 0.22052, 'precision@30': 0.15294}
-------------------------------------


TypeError: 'NoneType' object does not support item assignment

Higher is better for all the evaluation metrics.

This [towardsdatascience article](https://towardsdatascience.com/ranking-evaluation-metrics-for-recommender-systems-263d0a66ef54) covers NDCG, MAP and MRR in greater depth.