# **BEIR: A Heterogenous benchmark for Zero-shot Evaluation of Information Retrieval models** 

This notebook contains an simple and easy examples to evaluate retrieval models from our new benchmark.

## Introduction
The BEIR benchmark contains 9 diverse retrieval tasks including 17 diverse datasets. We evaluate 9 state-of-the-art retriever models all in a zero-shot evaluation setup. Today, in this colab notebook, we first will show how to download and load the 14 open-sourced datasets with just three lines of code. Afterward, we would load some state-of-the-art dense retrievers (bi-encoders) such as SBERT, ANCE, DPR models and use them for retrieval and evaluate them in a zero-shot setup.

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

Developed by Nandan Thakur, Researcher @ UKP Lab, TU Darmstadt

(https://nthakur.xyz) (nandant@gmail.com)

In [1]:
!nvidia-smi

Thu Dec 23 01:40:47 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P8    33W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
# Install the beir PyPI package
!pip install beir

Collecting beir
  Downloading beir-0.2.3.tar.gz (52 kB)
[K     |████████████████████████████████| 52 kB 1.3 MB/s 
[?25hCollecting sentence-transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 4.2 MB/s 
[?25hCollecting pytrec_eval
  Downloading pytrec_eval-0.5.tar.gz (15 kB)
Collecting faiss_cpu
  Downloading faiss_cpu-1.7.1.post3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.5 MB)
[K     |████████████████████████████████| 8.5 MB 16.3 MB/s 
[?25hCollecting elasticsearch
  Downloading elasticsearch-7.16.1-py2.py3-none-any.whl (385 kB)
[K     |████████████████████████████████| 385 kB 34.8 MB/s 
Collecting tensorflow-text
  Downloading tensorflow_text-2.7.3-cp37-cp37m-manylinux2010_x86_64.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 39.8 MB/s 
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 

In [3]:
from beir import util, LoggingHandler

import logging
import pathlib, os

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

  from tqdm.autonotebook import tqdm


# **BEIR Datasets**

BEIR contains 17 diverse datasets overall. You can view all the datasets (14 downloadable) with the link below:

[``https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/``](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/)

Please refer GitHub page to evaluate on other datasets (3 of them).


We include the following datasets in BEIR:

| Dataset   | Website| BEIR-Name | Domain     | Relevancy| Queries  | Documents | Avg. Docs/Q | Download | 
| -------- | -----| ---------| ----------- | ---------| ---------| --------- | ------| ------------| 
| MSMARCO    | [``Homepage``](https://microsoft.github.io/msmarco/)| ``msmarco`` | Misc.       |  Binary  |  6,980   |  8.84M     |    1.1 | Yes |  
| TREC-COVID |  [``Homepage``](https://ir.nist.gov/covidSubmit/index.html)| ``trec-covid``| Bio-Medical |  3-level|50|  171K| 493.5 | Yes | 
| NFCorpus   | [``Homepage``](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) | ``nfcorpus``  | Bio-Medical |  3-level |  323     |  3.6K     |  38.2 | Yes |
| BioASQ     | [``Homepage``](http://bioasq.org) | ``bioasq``| Bio-Medical |  Binary  |   500    |  14.91M    |  8.05 | No | 
| NQ         | [``Homepage``](https://ai.google.com/research/NaturalQuestions) | ``nq``| Wikipedia   |  Binary  |  3,452   |  2.68M  |  1.2 | Yes | 
| HotpotQA   | [``Homepage``](https://hotpotqa.github.io) | ``hotpotqa``| Wikipedia   |  Binary  |  7,405   |  5.23M  |  2.0 | Yes |
| FiQA-2018  | [``Homepage``](https://sites.google.com/view/fiqa/) | ``fiqa``    | Finance     |  Binary  |  648     |  57K    |  2.6 | Yes | 
| Signal-1M (RT) | [``Homepage``](https://research.signal-ai.com/datasets/signal1m-tweetir.html)| ``signal1m`` | Twitter     |  3-level  |   97   |  2.86M  |  19.6 | No |
| TREC-NEWS  | [``Homepage``](https://trec.nist.gov/data/news2019.html) | ``trec-news``    | News     |  5-level  |   57    |  595K    |  19.6 | No |
| ArguAna    | [``Homepage``](http://argumentation.bplaced.net/arguana/data) | ``arguana`` | Misc.       |  Binary  |  1,406     |  8.67K    |  1.0 | Yes |
| Touche-2020| [``Homepage``](https://webis.de/events/touche-20/shared-task-1.html) | ``webis-touche2020``| Misc.       |  6-level  |  49     |  382K    |  49.2 |  Yes |
| CQADupstack| [``Homepage``](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | ``cqadupstack``| StackEx.      |  Binary  |  13,145 |  457K  |  1.4 |  Yes |
| Quora| [``Homepage``](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) | ``quora``| Quora  | Binary  |  10,000     |  523K    |  1.6 |  Yes | 
| DBPedia | [``Homepage``](https://github.com/iai-group/DBpedia-Entity/) | ``dbpedia-entity``| Wikipedia |  3-level  |  400    |  4.63M    |  38.2 |  Yes | 
| SCIDOCS| [``Homepage``](https://allenai.org/data/scidocs) | ``scidocs``| Scientific |  Binary  |  1,000     |  25K    |  4.9 |  Yes | 
| FEVER| [``Homepage``](http://fever.ai) | ``fever``| Wikipedia     |  Binary  |  6,666     |  5.42M    |  1.2|  Yes | 
| Climate-FEVER| [``Homepage``](http://climatefever.ai) | ``climate-fever``| Wikipedia |  Binary  |  1,535     |  5.42M |  3.0 |  Yes |
| SciFact| [``Homepage``](https://github.com/allenai/scifact) | ``scifact``| Scientific |  Binary  |  300     |  5K    |  1.1 |  Yes |


For Simplicity, we will show example with the one of the smallest datasets - ``SciFact`` for our example. 

You can evaluate any dataset you wish by looking at the table above.

In [4]:
import pathlib, os
from beir import util

dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = os.path.join(os.getcwd(), "datasets")
data_path = util.download_and_unzip(url, out_dir)
print("Dataset downloaded here: {}".format(data_path))

2021-12-23 01:41:32 - Downloading scifact.zip ...


/content/datasets/scifact.zip:   0%|          | 0.00/2.69M [00:00<?, ?iB/s]

2021-12-23 01:41:35 - Unzipping scifact.zip ...
Dataset downloaded here: /content/datasets/scifact


# **Folder Structure of any BEIR dataset**

* scifact/
    * corpus.jsonl 
    * queries.jsonl 
    * qrels/
        * train.tsv
        * dev.tsv
        * test.tsv

In [5]:
!ls datasets/scifact/

corpus.jsonl  qrels  queries.jsonl


# **Data Loading**

In [6]:
from beir.datasets.data_loader import GenericDataLoader

data_path = "datasets/scifact"
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test") # or split = "train" or "dev"

2021-12-23 01:41:41 - Loading Corpus...


  0%|          | 0/5183 [00:00<?, ?it/s]

2021-12-23 01:41:41 - Loaded 5183 TEST Documents.
2021-12-23 01:41:41 - Doc Example: {'text': 'Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and result in functional disabilities. A line scan diffusion-weighted magnetic resonance imaging (MRI) sequence with diffusion tensor analysis was applied to measure the apparent diffusion coefficient, to calculate relative anisotropy, and to delineate three-dimensional fiber architecture in cerebral white matter in preterm (n = 17) and full-term infants (n = 7). To assess effects of prematurity on cerebral white matter development, early gestation preterm infants (n = 10) were studied a second time at term. In the central white matter the mean apparent diffusion coefficient at 28 wk was high, 1.8 microm2/ms, and decreased toward term to 1.2 microm2/ms. In the posterior limb of the internal capsule, the mean apparent diffusion coefficients at both times were similar (1.2 vers

# **Dense Retrieval using Exact Search**

## **Sentence-BERT**
We use the [``distilbert-base-msmarco-v3``](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html) SBERT model in this example.

In [7]:
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval import models
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

#### Dense Retrieval using SBERT (Sentence-BERT) ####
#### Provide any pretrained sentence-transformers model
#### The model was fine-tuned using cosine-similarity.
#### Complete list - https://www.sbert.net/docs/pretrained_models.html

model = DRES(models.SentenceBERT("msmarco-distilbert-base-v3"), batch_size=128)
retriever = EvaluateRetrieval(model, score_function="cos_sim")

#### Retrieve dense results (format of results is identical to qrels)
results = retriever.retrieve(corpus, queries)

2021-12-23 01:41:43 - Loading faiss with AVX2 support.
2021-12-23 01:41:43 - Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
2021-12-23 01:41:43 - Loading faiss.
2021-12-23 01:41:43 - Successfully loaded faiss.
2021-12-23 01:41:47 - Load pretrained SentenceTransformer: msmarco-distilbert-base-v3


Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/545 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/499 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

2021-12-23 01:42:01 - Use pytorch device: cuda
2021-12-23 01:42:01 - Encoding Queries...


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

2021-12-23 01:42:13 - Sorting Corpus by document length (Longest first)...
2021-12-23 01:42:13 - Scoring Function: Cosine Similarity (cos_sim)
2021-12-23 01:42:13 - Encoding Batch 1/1...


Batches:   0%|          | 0/41 [00:00<?, ?it/s]

In [8]:
#### Evaluate your retrieval using NDCG@k, MAP@K ...

logging.info("Retriever evaluation for k in: {}".format(retriever.k_values))
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

2021-12-23 01:44:59 - Retriever evaluation for k in: [1, 3, 5, 10, 100, 1000]
2021-12-23 01:45:00 - 

2021-12-23 01:45:00 - NDCG@1: 0.4233
2021-12-23 01:45:00 - NDCG@3: 0.4842
2021-12-23 01:45:00 - NDCG@5: 0.5104
2021-12-23 01:45:00 - NDCG@10: 0.5379
2021-12-23 01:45:00 - NDCG@100: 0.5759
2021-12-23 01:45:00 - NDCG@1000: 0.5913
2021-12-23 01:45:00 - 

2021-12-23 01:45:00 - MAP@1: 0.3994
2021-12-23 01:45:00 - MAP@3: 0.4593
2021-12-23 01:45:00 - MAP@5: 0.4768
2021-12-23 01:45:00 - MAP@10: 0.4889
2021-12-23 01:45:00 - MAP@100: 0.4974
2021-12-23 01:45:00 - MAP@1000: 0.4980
2021-12-23 01:45:00 - 

2021-12-23 01:45:00 - Recall@1: 0.3994
2021-12-23 01:45:00 - Recall@3: 0.5256
2021-12-23 01:45:00 - Recall@5: 0.5887
2021-12-23 01:45:00 - Recall@10: 0.6723
2021-12-23 01:45:00 - Recall@100: 0.8460
2021-12-23 01:45:00 - Recall@1000: 0.9683
2021-12-23 01:45:00 - 

2021-12-23 01:45:00 - P@1: 0.4233
2021-12-23 01:45:00 - P@3: 0.1933
2021-12-23 01:45:00 - P@5: 0.1333
2021-12-23 01:45:00 - P@10: 0.0757

In [9]:
import random

#### Print top-k documents retrieved ####
top_k = 10

query_id, ranking_scores = random.choice(list(results.items()))
scores_sorted = sorted(ranking_scores.items(), key=lambda item: item[1], reverse=True)
logging.info("Query : %s\n" % queries[query_id])

for rank in range(top_k):
    doc_id = scores_sorted[rank][0]
    # Format: Rank x: ID [Title] Body
    logging.info("Rank %d: %s [%s] - %s\n" % (rank+1, doc_id, corpus[doc_id].get("title"), corpus[doc_id].get("text")))

2021-12-23 01:45:00 - Query : Sildenafil improves erectile function in men who experience sexual dysfunction as a result of the use of SSRI antidepressants.

2021-12-23 01:45:00 - Rank 1: 39281140 [Treatment of antidepressant-associated sexual dysfunction with sildenafil: a randomized controlled trial.] - CONTEXT Sexual dysfunction is a common adverse effect of antidepressants that frequently results in treatment noncompliance. OBJECTIVE To assess the efficacy of sildenafil citrate in men with sexual dysfunction associated with the use of selective and nonselective serotonin reuptake inhibitor (SRI) antidepressants. DESIGN, SETTING, AND PATIENTS Prospective, parallel-group, randomized, double-blind, placebo-controlled trial conducted between November 1, 2000, and January 1, 2001, at 3 US university medical centers among 90 male outpatients (mean [SD] age, 45 [8] years) with major depression in remission and sexual dysfunction associated with SRI antidepressant treatment. INTERVENTION P

## **ANCE**

We use the [``msmarco-roberta-base-ance-fristp``](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html) ANCE model which was fine-tuned on MSMARCO dataset for 600K steps.

In [10]:
#### Dense Retrieval using ANCE #### 
# https://www.sbert.net/docs/pretrained-models/msmarco-v3.html
# MSMARCO Dev Passage Retrieval ANCE(FirstP) 600K model from ANCE.
# The ANCE model was fine-tuned using dot-product (dot) function.

model = DRES(models.SentenceBERT("msmarco-roberta-base-ance-fristp"))
retriever = EvaluateRetrieval(model, score_function="dot")

#### Retrieve dense results (format of results is identical to qrels)
results = retriever.retrieve(corpus, queries)

2021-12-23 01:47:15 - Load pretrained SentenceTransformer: msmarco-roberta-base-ance-fristp


HTTPError: ignored

In [11]:
#### Evaluate your retrieval using NDCG@k, MAP@K ...

logging.info("Retriever evaluation for k in: {}".format(retriever.k_values))
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

2021-12-23 01:47:16 - Retriever evaluation for k in: [1, 3, 5, 10, 100, 1000]
2021-12-23 01:47:16 - 

2021-12-23 01:47:16 - NDCG@1: 0.4233
2021-12-23 01:47:16 - NDCG@3: 0.4842
2021-12-23 01:47:16 - NDCG@5: 0.5104
2021-12-23 01:47:16 - NDCG@10: 0.5379
2021-12-23 01:47:16 - NDCG@100: 0.5759
2021-12-23 01:47:16 - NDCG@1000: 0.5913
2021-12-23 01:47:16 - 

2021-12-23 01:47:16 - MAP@1: 0.3994
2021-12-23 01:47:16 - MAP@3: 0.4593
2021-12-23 01:47:16 - MAP@5: 0.4768
2021-12-23 01:47:16 - MAP@10: 0.4889
2021-12-23 01:47:16 - MAP@100: 0.4974
2021-12-23 01:47:16 - MAP@1000: 0.4980
2021-12-23 01:47:16 - 

2021-12-23 01:47:16 - Recall@1: 0.3994
2021-12-23 01:47:16 - Recall@3: 0.5256
2021-12-23 01:47:16 - Recall@5: 0.5887
2021-12-23 01:47:16 - Recall@10: 0.6723
2021-12-23 01:47:16 - Recall@100: 0.8460
2021-12-23 01:47:16 - Recall@1000: 0.9683
2021-12-23 01:47:16 - 

2021-12-23 01:47:16 - P@1: 0.4233
2021-12-23 01:47:16 - P@3: 0.1933
2021-12-23 01:47:16 - P@5: 0.1333
2021-12-23 01:47:16 - P@10: 0.0757

# **Lexical Retrieval using BM25 (Elasticsearch)**

## 1. Download and setup the Elasticsearch instance
Reference: https://colab.research.google.com/github/tensorflow/io/blob/master/docs/tutorials/elasticsearch.ipynb

For demo purposes, the open-source version of the elasticsearch package is used.

In [12]:
%%bash

wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
sudo chown -R daemon:daemon elasticsearch-7.9.2/
shasum -a 512 -c elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512 

elasticsearch-oss-7.9.2-linux-x86_64.tar.gz: OK


Run the instance as a daemon process


In [13]:
%%bash --bg

sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch

Starting job # 0 in a separate thread.


In [14]:
import time

# Sleep for few seconds to let the instance start.
time.sleep(20)

Once the instance has been started, grep for ``elasticsearch`` in the processes list to confirm the availability.

In [15]:
%%bash

ps -ef | grep elasticsearch

root         442     440  0 01:47 ?        00:00:00 sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch
daemon       443     442 99 01:47 ?        00:00:21 /content/elasticsearch-7.9.2/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-13316508362059330965 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:fileco

In [16]:
%%bash

curl -sX GET "localhost:9200/"

{
  "name" : "0bb4cffb3c4d",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "7bL_RwFqS9Kz9wIl0lYh-A",
  "version" : {
    "number" : "7.9.2",
    "build_flavor" : "oss",
    "build_type" : "tar",
    "build_hash" : "d34da0ea4a966c4e49417f2da2f244e3e97b4e6e",
    "build_date" : "2020-09-23T00:45:33.626720Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


In [17]:
from beir.retrieval.search.lexical import BM25Search as BM25
from beir.retrieval.evaluation import EvaluateRetrieval

#### Provide parameters for elastic-search
hostname = "localhost" 
index_name = "scifact" 
initialize = True # True, will delete existing index with same name and reindex all documents

model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)
retriever = EvaluateRetrieval(model)

#### Retrieve dense results (format of results is identical to qrels)
results = retriever.retrieve(corpus, queries)

2021-12-23 01:47:49 - Activating Elasticsearch....
2021-12-23 01:47:49 - Elastic Search Credentials: {'hostname': 'localhost', 'index_name': 'scifact', 'keys': {'title': 'title', 'body': 'txt'}, 'timeout': 100, 'retry_on_timeout': True, 'maxsize': 24, 'number_of_shards': 'default', 'language': 'english'}
2021-12-23 01:47:49 - Deleting previous Elasticsearch-Index named - scifact
2021-12-23 01:47:49 - Unable to create Index in Elastic Search. Reason: The client noticed that the server is not a supported distribution of Elasticsearch
2021-12-23 01:47:49 - Creating fresh Elasticsearch-Index named - scifact
2021-12-23 01:47:49 - Unable to create Index in Elastic Search. Reason: The client noticed that the server is not a supported distribution of Elasticsearch


  0%|          | 0/5183 [00:00<?, ?docs/s]

UnsupportedProductError: ignored

In [None]:
#### Evaluate your retrieval using NDCG@k, MAP@K ...
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

# **Reranking BM25 using Cross-Encoder**

In this example, we rerank the top-20 documents retrieved from BM25, using ([cross-encoder/ms-marco-electra-base](https://www.sbert.net/docs/pretrained-models/ce-msmarco.html)) SBERT cross-encoder model 

In [18]:
from beir.reranking.models import CrossEncoder
from beir.reranking import Rerank

#### Reranking using Cross-Encoder models (list: )
cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-electra-base')
reranker = Rerank(cross_encoder_model, batch_size=128)

# Rerank top-100 results using the reranker provided
rerank_results = reranker.rerank(corpus, queries, results, top_k=20)

Downloading:   0%|          | 0.00/730 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

  0%|          | 0/5183 [01:22<?, ?docs/s]

2021-12-23 01:49:11 - Use pytorch device: cuda
2021-12-23 01:49:11 - Starting To Rerank Top-20....


Batches:   0%|          | 0/47 [00:00<?, ?it/s]

In [21]:
#### Evaluate your retrieval using NDCG@k, MAP@K ...
ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, rerank_results, retriever.k_values)
mrr = retriever.evaluate_custom(qrels, rerank_results, retriever.k_values, metric="mrr")
top_k_accuracy = retriever.evaluate_custom(qrels, rerank_results, retriever.k_values, metric="top_k_accuracy")

  0%|          | 0/5183 [10:29<?, ?docs/s]

2021-12-23 01:58:19 - 

2021-12-23 01:58:19 - NDCG@1: 0.5500
2021-12-23 01:58:19 - NDCG@3: 0.5977
2021-12-23 01:58:19 - NDCG@5: 0.6144
2021-12-23 01:58:19 - NDCG@10: 0.6349
2021-12-23 01:58:19 - NDCG@100: 0.6369
2021-12-23 01:58:19 - NDCG@1000: 0.6369
2021-12-23 01:58:19 - 

2021-12-23 01:58:19 - MAP@1: 0.5208
2021-12-23 01:58:19 - MAP@3: 0.5753
2021-12-23 01:58:19 - MAP@5: 0.5870
2021-12-23 01:58:19 - MAP@10: 0.5965


  0%|          | 0/5183 [10:29<?, ?docs/s]

2021-12-23 01:58:19 - MAP@100: 0.5972
2021-12-23 01:58:19 - MAP@1000: 0.5972
2021-12-23 01:58:19 - 

2021-12-23 01:58:19 - Recall@1: 0.5208
2021-12-23 01:58:19 - Recall@3: 0.6339
2021-12-23 01:58:19 - Recall@5: 0.6762
2021-12-23 01:58:19 - Recall@10: 0.7368
2021-12-23 01:58:19 - Recall@100: 0.7441
2021-12-23 01:58:19 - Recall@1000: 0.7441
2021-12-23 01:58:19 - 

2021-12-23 01:58:19 - P@1: 0.5500
2021-12-23 01:58:19 - P@3: 0.2300
2021-12-23 01:58:19 - P@5: 0.1507
2021-12-23 01:58:19 - P@10: 0.0827
2021-12-23 01:58:19 - P@100: 0.0084
2021-12-23 01:58:19 - P@1000: 0.0008
2021-12-23 01:58:19 - 

2021-12-23 01:58:19 - MRR@1: 0.5500


  0%|          | 0/5183 [10:30<?, ?docs/s]

2021-12-23 01:58:19 - MRR@3: 0.5961
2021-12-23 01:58:19 - MRR@5: 0.6051
2021-12-23 01:58:19 - MRR@10: 0.6128
2021-12-23 01:58:19 - MRR@100: 0.6132
2021-12-23 01:58:19 - MRR@1000: 0.6132
2021-12-23 01:58:19 - 

2021-12-23 01:58:19 - Accuracy@1: 0.5500
2021-12-23 01:58:19 - Accuracy@3: 0.6567
2021-12-23 01:58:19 - Accuracy@5: 0.6967
2021-12-23 01:58:19 - Accuracy@10: 0.7533
2021-12-23 01:58:19 - Accuracy@100: 0.7600
2021-12-23 01:58:19 - Accuracy@1000: 0.7600
