#### Embeddings Model Evaluation

In this notebook we aim to evaluate the performance of difference HuggingFace Embedding models on the nfcorpus benchmark in order to decide with one would be best suited for use in our pipeline

In [1]:
%pip install llama-index-embeddings-huggingface

Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.4.0-py3-none-any.whl.metadata (767 bytes)
Collecting llama-index-core<0.13.0,>=0.12.0 (from llama-index-embeddings-huggingface)
  Downloading llama_index_core-0.12.3-py3-none-any.whl.metadata (2.5 kB)
Collecting dataclasses-json (from llama-index-core<0.13.0,>=0.12.0->llama-index-embeddings-huggingface)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index-core<0.13.0,>=0.12.0->llama-index-embeddings-huggingface)
  Downloading dirtyjson-1.0.8-py3-none-any.whl.metadata (11 kB)
Collecting filetype<2.0.0,>=1.2.0 (from llama-index-core<0.13.0,>=0.12.0->llama-index-embeddings-huggingface)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting pydantic<2.10.0,>=2.7.0 (from llama-index-core<0.13.0,>=0.12.0->llama-index-embeddings-huggingface)
  Downloading pydantic-2.9.2-py3-none-any.whl.metadata (149 kB)

In [2]:
%pip install llama-index

Collecting llama-index
  Downloading llama_index-0.12.3-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-agent-openai<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_agent_openai-0.4.0-py3-none-any.whl.metadata (726 bytes)
Collecting llama-index-cli<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_cli-0.4.0-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-embeddings-openai<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.3.1-py3-none-any.whl.metadata (684 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.6.3-py3-none-any.whl.metadata (3.8 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading llama_index_legacy-0.9.48.post4-py3-none-any.whl.metadata (8.5 kB)
Collecting llama-index-llms-openai<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_llms_openai-0.3.2-py3-none-any.whl.metadata (3.3 kB)


In [3]:
%pip install beir

Collecting beir
  Downloading beir-2.0.0.tar.gz (53 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/53.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.6/53.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pytrec_eval (from beir)
  Downloading pytrec_eval-0.5.tar.gz (15 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting faiss_cpu (from beir)
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting elasticsearch==7.9.1 (from beir)
  Downloading elasticsearch-7.9.1-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting datasets (from beir)
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets->beir)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets->beir)
  Downloa

In [6]:
import io
import sys
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.base.base_retriever import BaseRetriever
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import Document
from llama_index.core.evaluation.benchmarks import BeirEvaluator

def create_retriever(documents, model_name):
    embed_model = HuggingFaceEmbedding(model_name=model_name)
    index = VectorStoreIndex.from_documents(
        documents, embed_model=embed_model, show_progress=True
    )
    return index.as_retriever(similarity_top_k=30)

models_to_test = [
    "BAAI/bge-small-en-v1.5",
    "BAAI/bge-large-en",
    "sentence-transformers/all-mpnet-base-v2",
    "intfloat/e5-large",
    "sentence-transformers/all-MiniLM-L6-v2",
    "sentence-transformers/gtr-t5-large",
    "BAAI/bge-large-zh-v1.5"
]

results = []

for model_name in models_to_test:
    print(f"Testing model: {model_name}")

    retriever_function = lambda docs: create_retriever(docs, model_name)

    BeirEvaluator().run(
        retriever_function,
        datasets=["nfcorpus"],
        metrics_k_values=[3, 10, 30]
    )


Testing model: BAAI/bge-small-en-v1.5
Dataset: nfcorpus downloaded at: /tmp/llama_index/datasets/BeIR__nfcorpus
Evaluating on dataset: nfcorpus
-------------------------------------


  0%|          | 0/3633 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/3633 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1595 [00:00<?, ?it/s]

Retriever created for:  nfcorpus
Evaluating retriever on questions against qrels


100%|██████████| 323/323 [00:38<00:00,  8.38it/s]

Results for: nfcorpus
{'NDCG@3': 0.38766, 'MAP@3': 0.0847, 'Recall@3': 0.09833, 'precision@3': 0.37255}
{'NDCG@10': 0.3361, 'MAP@10': 0.11749, 'Recall@10': 0.15986, 'precision@10': 0.25325}
{'NDCG@30': 0.30161, 'MAP@30': 0.13719, 'Recall@30': 0.22052, 'precision@30': 0.15294}
-------------------------------------
Testing model: BAAI/bge-large-en
Dataset: nfcorpus downloaded at: /tmp/llama_index/datasets/BeIR__nfcorpus
Evaluating on dataset: nfcorpus
-------------------------------------





  0%|          | 0/3633 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/3633 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1595 [00:00<?, ?it/s]

Retriever created for:  nfcorpus
Evaluating retriever on questions against qrels


100%|██████████| 323/323 [01:17<00:00,  4.16it/s]

Results for: nfcorpus
{'NDCG@3': 0.39825, 'MAP@3': 0.09866, 'Recall@3': 0.10807, 'precision@3': 0.37564}
{'NDCG@10': 0.34273, 'MAP@10': 0.13283, 'Recall@10': 0.16902, 'precision@10': 0.25077}
{'NDCG@30': 0.31692, 'MAP@30': 0.15491, 'Recall@30': 0.23701, 'precision@30': 0.15542}
-------------------------------------
Testing model: sentence-transformers/all-mpnet-base-v2
Dataset: nfcorpus downloaded at: /tmp/llama_index/datasets/BeIR__nfcorpus
Evaluating on dataset: nfcorpus
-------------------------------------





  0%|          | 0/3633 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/3633 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1595 [00:00<?, ?it/s]

Retriever created for:  nfcorpus
Evaluating retriever on questions against qrels


100%|██████████| 323/323 [01:00<00:00,  5.30it/s]


Results for: nfcorpus
{'NDCG@3': 0.38528, 'MAP@3': 0.08685, 'Recall@3': 0.09944, 'precision@3': 0.37049}
{'NDCG@10': 0.33366, 'MAP@10': 0.12104, 'Recall@10': 0.16199, 'precision@10': 0.25201}
{'NDCG@30': 0.30722, 'MAP@30': 0.1425, 'Recall@30': 0.23177, 'precision@30': 0.15624}
-------------------------------------
Testing model: intfloat/e5-large
Dataset: nfcorpus downloaded at: /tmp/llama_index/datasets/BeIR__nfcorpus
Evaluating on dataset: nfcorpus
-------------------------------------


  0%|          | 0/3633 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/3633 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1595 [00:00<?, ?it/s]

Retriever created for:  nfcorpus
Evaluating retriever on questions against qrels


100%|██████████| 323/323 [01:18<00:00,  4.12it/s]

Results for: nfcorpus
{'NDCG@3': 0.42656, 'MAP@3': 0.10532, 'Recall@3': 0.11589, 'precision@3': 0.40248}
{'NDCG@10': 0.37556, 'MAP@10': 0.14489, 'Recall@10': 0.18625, 'precision@10': 0.27709}
{'NDCG@30': 0.3455, 'MAP@30': 0.16929, 'Recall@30': 0.25686, 'precision@30': 0.17069}
-------------------------------------
Testing model: sentence-transformers/all-MiniLM-L6-v2
Dataset: nfcorpus downloaded at: /tmp/llama_index/datasets/BeIR__nfcorpus
Evaluating on dataset: nfcorpus
-------------------------------------





  0%|          | 0/3633 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/3633 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1595 [00:00<?, ?it/s]

Retriever created for:  nfcorpus
Evaluating retriever on questions against qrels


100%|██████████| 323/323 [00:37<00:00,  8.58it/s]

Results for: nfcorpus
{'NDCG@3': 0.35476, 'MAP@3': 0.07489, 'Recall@3': 0.08583, 'precision@3': 0.33746}
{'NDCG@10': 0.31425, 'MAP@10': 0.11007, 'Recall@10': 0.15886, 'precision@10': 0.24025}
{'NDCG@30': 0.28666, 'MAP@30': 0.12801, 'Recall@30': 0.21727, 'precision@30': 0.14737}
-------------------------------------
Testing model: sentence-transformers/gtr-t5-large
Dataset: nfcorpus downloaded at: /tmp/llama_index/datasets/BeIR__nfcorpus
Evaluating on dataset: nfcorpus
-------------------------------------





  0%|          | 0/3633 [00:00<?, ?it/s]

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/1.87k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

Parsing nodes:   0%|          | 0/3633 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1595 [00:00<?, ?it/s]

Retriever created for:  nfcorpus
Evaluating retriever on questions against qrels


100%|██████████| 323/323 [01:05<00:00,  4.97it/s]

Results for: nfcorpus
{'NDCG@3': 0.39043, 'MAP@3': 0.09262, 'Recall@3': 0.10317, 'precision@3': 0.36739}
{'NDCG@10': 0.32691, 'MAP@10': 0.12083, 'Recall@10': 0.15619, 'precision@10': 0.23406}
{'NDCG@30': 0.29553, 'MAP@30': 0.13804, 'Recall@30': 0.21242, 'precision@30': 0.1418}
-------------------------------------
Testing model: BAAI/bge-large-zh-v1.5
Dataset: nfcorpus downloaded at: /tmp/llama_index/datasets/BeIR__nfcorpus
Evaluating on dataset: nfcorpus
-------------------------------------





  0%|          | 0/3633 [00:00<?, ?it/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/30.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.30G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/439k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Parsing nodes:   0%|          | 0/3633 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1595 [00:00<?, ?it/s]

Retriever created for:  nfcorpus
Evaluating retriever on questions against qrels


100%|██████████| 323/323 [01:18<00:00,  4.14it/s]

Results for: nfcorpus
{'NDCG@3': 0.26704, 'MAP@3': 0.05119, 'Recall@3': 0.06084, 'precision@3': 0.24665}
{'NDCG@10': 0.22158, 'MAP@10': 0.06966, 'Recall@10': 0.10174, 'precision@10': 0.16223}
{'NDCG@30': 0.19627, 'MAP@30': 0.07918, 'Recall@30': 0.14605, 'precision@30': 0.09649}
-------------------------------------





Higher is better for all the evaluation metrics.

This [towardsdatascience article](https://towardsdatascience.com/ranking-evaluation-metrics-for-recommender-systems-263d0a66ef54) covers NDCG, MAP and MRR in greater depth.

| **Modèle**                       | **NDCG@3** | **MAP@3** | **Recall@3** | **Precision@3** | **NDCG@10** | **MAP@10** | **Recall@10** | **Precision@10** | **NDCG@30** | **MAP@30** | **Recall@30** | **Precision@30** |
|-----------------------------------|------------|-----------|--------------|-----------------|-------------|------------|---------------|------------------|-------------|------------|---------------|-------------------|
| BAAI/bge-small-en-v1.5            | 0.38766    | 0.0847    | 0.09833      | 0.37255         | 0.3361      | 0.11749    | 0.15986       | 0.25325          | 0.30161     | 0.13719    | 0.22052       | 0.15294           |
| BAAI/bge-large-en                 | 0.39825    | 0.09866   | 0.10807      | 0.37564         | 0.34273     | 0.13283    | 0.16902       | 0.25077          | 0.31692     | 0.15491    | 0.23701       | 0.15542           |
| sentence-transformers/all-mpnet-base-v2 | 0.38528 | 0.08685   | 0.09944      | 0.37049         | 0.33366     | 0.12104    | 0.16199       | 0.25201          | 0.30722     | 0.1425     | 0.23177       | 0.15624           |
| intfloat/e5-large                 | **0.42656**| **0.10532**| **0.11589**  | **0.40248**     | **0.37556** | **0.14489**| **0.18625**   | **0.27709**      | **0.3455**  | **0.16929**| **0.25686**   | **0.17069**       |
| sentence-transformers/all-MiniLM-L6-v2 | 0.35476 | 0.07489   | 0.08583      | 0.33746         | 0.31425     | 0.11007    | 0.15886       | 0.24025          | 0.28666     | 0.12801    | 0.21727       | 0.14737           |
| sentence-transformers/gtr-t5-large| 0.39043    | 0.09262   | 0.10317      | 0.36739         | 0.32691     | 0.12083    | 0.15619       | 0.23406          | 0.29553     | 0.13804    | 0.21242       | 0.1418            |
| BAAI/bge-large-zh-v1.5            | 0.26704    | 0.05119   | 0.06084      | 0.24665         | 0.22158     | 0.06966    | 0.10174       | 0.16223          | 0.19627     | 0.07918    | 0.14605       | 0.09649           |

After evaluation it thus appears that the `intflot/e5-large` model is the best choice