<a href="https://colab.research.google.com/github/ChenKua/xir/blob/main/beir_haystack_document_retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# installation

In [None]:
!pip install beir
!pip install tensorflow-text
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,faiss]

In [None]:
from beir import util, LoggingHandler
from beir.retrieval import models
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

import logging
import pathlib, os

from typing import List
import requests
import pandas as pd
from haystack import Document
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import RAGenerator, DensePassageRetriever
from haystack.utils import fetch_archive_from_http

# BEIR dataset

* scifact/
    * corpus.jsonl 
    * queries.jsonl 
    * qrels/
        * train.tsv
        * dev.tsv
        * test.tsv

In [5]:
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()],
                    force=True)

dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = os.path.join(os.getcwd(), "datasets")
data_path = util.download_and_unzip(url, out_dir)
print("Dataset downloaded here: {}".format(data_path))

/content/datasets/scifact.zip:   0%|          | 0.00/2.69M [00:00<?, ?iB/s]

Dataset downloaded here: /content/datasets/scifact


In [6]:
from beir.datasets.data_loader import GenericDataLoader

data_path = "datasets/scifact"
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test") # or split = "train" or "dev"

  0%|          | 0/5183 [00:00<?, ?it/s]

In [7]:
pd_corpus = pd.DataFrame(corpus)
pd_corpus = pd_corpus.transpose()
pd.DataFrame(pd_corpus)


Unnamed: 0,text,title
4983,Alterations of the architecture of cerebral white matter in the developing h...,Microstructural development of human newborn cerebral white matter assessed ...
5836,Myelodysplastic syndromes (MDS) are age-dependent stem cell malignancies tha...,Induction of myelodysplasia by myeloid-derived suppressor cells.
7912,ID elements are short interspersed elements (SINEs) found in high copy numbe...,"BC1 RNA, the transcript from a master gene for ID element amplification, is ..."
18670,DNA methylation plays an important role in biological processes in human hea...,The DNA Methylome of Human Peripheral Blood Mononuclear Cells
19238,Two human Golli (for gene expressed in the oligodendrocyte lineage)-MBP (for...,The human myelin basic protein gene is included within a 179-kilobase transc...
...,...,...
195689316,BACKGROUND The main associations of body-mass index (BMI) with overall and c...,Body-mass index and cause-specific mortality in 900 000 adults: collaborativ...
195689757,A key aberrant biological difference between tumor cells and normal differen...,Targeting metabolic remodeling in glioblastoma multiforme.
196664003,A signaling pathway transmits information from an upstream system to downstr...,Signaling architectures that transmit unidirectional information despite ret...
198133135,AIMS Trabecular bone score (TBS) is a surrogate indicator of bone microarchi...,"Association between pre-diabetes, type 2 diabetes and trabecular bone score:..."


In [8]:
pd_qrels = pd.DataFrame(qrels)
pd.DataFrame(pd_qrels)

Unnamed: 0,1,3,5,13,36,42,48,49,50,51,...,1359,1362,1363,1368,1370,1379,1382,1385,1389,1395
31715818,1.0,,,,,,,,,,...,,,,,,,,,,
14717500,,1.0,,,,,,,,,...,,,,,,,,,,
13734012,,,1.0,,,,1.0,,,,...,,,,,,,,,,
1606628,,,,1.0,,,,,,,...,,,,,,,,,,
5152028,,,,,1.0,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2425364,,,,,,,,,,,...,,,,1.0,1.0,,,,,
17755060,,,,,,,,,,,...,,,,,,,1.0,,,
306006,,,,,,,,,,,...,,,,,,,,1.0,,
23895668,,,,,,,,,,,...,,,,,,,,,1.0,


In [9]:
pd_queries = pd.DataFrame(queries,queries.items())
pd.DataFrame(pd_queries)

Unnamed: 0,Unnamed: 1,1,3,5,13,36,42,48,49,50,51,...,1359,1362,1363,1368,1370,1379,1382,1385,1389,1395
1,0-dimensional biomaterials show inductive properties.,0-dimensional biomaterials show inductive properties.,"1,000 genomes project enables mapping of genetic sequence variation consisti...",1/2000 in UK have abnormal PrP positivity.,5% of perinatal mortality is due to low birth weight.,A deficiency of vitamin B12 increases blood levels of homocysteine.,A high microerythrocyte count raises vulnerability to severe anemia in homoz...,"A total of 1,000 people in the UK are asymptomatic carriers of vCJD infection.",ADAR1 binds to Dicer to cleave pre-miRNA.,AIRE is expressed in some skin tumors.,ALDH1 expression is associated with better breast cancer outcomes.,...,Varenicline monotherapy is more effective after 12 weeks of treatment compar...,Venules have a larger lumen diameter than arterioles.,Venules have a thinner or absent smooth layer compared to arterioles.,Vitamin D deficiency effects the term of delivery.,Vitamin D deficiency is unrelated to birth weight.,Women with a higher birth weight are more likely to develop breast cancer la...,aPKCz causes tumour enhancement by affecting glutamine metabolism.,cSMAC formation enhances weak ligand signalling.,mTORC2 regulates intracellular cysteine levels through xCT inhibition.,p16INK4A accumulation is linked to an abnormal wound response caused by the...
3,"1,000 genomes project enables mapping of genetic sequence variation consisting of rare variants with larger penetrance effects than common variants.",0-dimensional biomaterials show inductive properties.,"1,000 genomes project enables mapping of genetic sequence variation consisti...",1/2000 in UK have abnormal PrP positivity.,5% of perinatal mortality is due to low birth weight.,A deficiency of vitamin B12 increases blood levels of homocysteine.,A high microerythrocyte count raises vulnerability to severe anemia in homoz...,"A total of 1,000 people in the UK are asymptomatic carriers of vCJD infection.",ADAR1 binds to Dicer to cleave pre-miRNA.,AIRE is expressed in some skin tumors.,ALDH1 expression is associated with better breast cancer outcomes.,...,Varenicline monotherapy is more effective after 12 weeks of treatment compar...,Venules have a larger lumen diameter than arterioles.,Venules have a thinner or absent smooth layer compared to arterioles.,Vitamin D deficiency effects the term of delivery.,Vitamin D deficiency is unrelated to birth weight.,Women with a higher birth weight are more likely to develop breast cancer la...,aPKCz causes tumour enhancement by affecting glutamine metabolism.,cSMAC formation enhances weak ligand signalling.,mTORC2 regulates intracellular cysteine levels through xCT inhibition.,p16INK4A accumulation is linked to an abnormal wound response caused by the...
5,1/2000 in UK have abnormal PrP positivity.,0-dimensional biomaterials show inductive properties.,"1,000 genomes project enables mapping of genetic sequence variation consisti...",1/2000 in UK have abnormal PrP positivity.,5% of perinatal mortality is due to low birth weight.,A deficiency of vitamin B12 increases blood levels of homocysteine.,A high microerythrocyte count raises vulnerability to severe anemia in homoz...,"A total of 1,000 people in the UK are asymptomatic carriers of vCJD infection.",ADAR1 binds to Dicer to cleave pre-miRNA.,AIRE is expressed in some skin tumors.,ALDH1 expression is associated with better breast cancer outcomes.,...,Varenicline monotherapy is more effective after 12 weeks of treatment compar...,Venules have a larger lumen diameter than arterioles.,Venules have a thinner or absent smooth layer compared to arterioles.,Vitamin D deficiency effects the term of delivery.,Vitamin D deficiency is unrelated to birth weight.,Women with a higher birth weight are more likely to develop breast cancer la...,aPKCz causes tumour enhancement by affecting glutamine metabolism.,cSMAC formation enhances weak ligand signalling.,mTORC2 regulates intracellular cysteine levels through xCT inhibition.,p16INK4A accumulation is linked to an abnormal wound response caused by the...
13,5% of perinatal mortality is due to low birth weight.,0-dimensional biomaterials show inductive properties.,"1,000 genomes project enables mapping of genetic sequence variation consisti...",1/2000 in UK have abnormal PrP positivity.,5% of perinatal mortality is due to low birth weight.,A deficiency of vitamin B12 increases blood levels of homocysteine.,A high microerythrocyte count raises vulnerability to severe anemia in homoz...,"A total of 1,000 people in the UK are asymptomatic carriers of vCJD infection.",ADAR1 binds to Dicer to cleave pre-miRNA.,AIRE is expressed in some skin tumors.,ALDH1 expression is associated with better breast cancer outcomes.,...,Varenicline monotherapy is more effective after 12 weeks of treatment compar...,Venules have a larger lumen diameter than arterioles.,Venules have a thinner or absent smooth layer compared to arterioles.,Vitamin D deficiency effects the term of delivery.,Vitamin D deficiency is unrelated to birth weight.,Women with a higher birth weight are more likely to develop breast cancer la...,aPKCz causes tumour enhancement by affecting glutamine metabolism.,cSMAC formation enhances weak ligand signalling.,mTORC2 regulates intracellular cysteine levels through xCT inhibition.,p16INK4A accumulation is linked to an abnormal wound response caused by the...
36,A deficiency of vitamin B12 increases blood levels of homocysteine.,0-dimensional biomaterials show inductive properties.,"1,000 genomes project enables mapping of genetic sequence variation consisti...",1/2000 in UK have abnormal PrP positivity.,5% of perinatal mortality is due to low birth weight.,A deficiency of vitamin B12 increases blood levels of homocysteine.,A high microerythrocyte count raises vulnerability to severe anemia in homoz...,"A total of 1,000 people in the UK are asymptomatic carriers of vCJD infection.",ADAR1 binds to Dicer to cleave pre-miRNA.,AIRE is expressed in some skin tumors.,ALDH1 expression is associated with better breast cancer outcomes.,...,Varenicline monotherapy is more effective after 12 weeks of treatment compar...,Venules have a larger lumen diameter than arterioles.,Venules have a thinner or absent smooth layer compared to arterioles.,Vitamin D deficiency effects the term of delivery.,Vitamin D deficiency is unrelated to birth weight.,Women with a higher birth weight are more likely to develop breast cancer la...,aPKCz causes tumour enhancement by affecting glutamine metabolism.,cSMAC formation enhances weak ligand signalling.,mTORC2 regulates intracellular cysteine levels through xCT inhibition.,p16INK4A accumulation is linked to an abnormal wound response caused by the...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1379,Women with a higher birth weight are more likely to develop breast cancer later in life.,0-dimensional biomaterials show inductive properties.,"1,000 genomes project enables mapping of genetic sequence variation consisti...",1/2000 in UK have abnormal PrP positivity.,5% of perinatal mortality is due to low birth weight.,A deficiency of vitamin B12 increases blood levels of homocysteine.,A high microerythrocyte count raises vulnerability to severe anemia in homoz...,"A total of 1,000 people in the UK are asymptomatic carriers of vCJD infection.",ADAR1 binds to Dicer to cleave pre-miRNA.,AIRE is expressed in some skin tumors.,ALDH1 expression is associated with better breast cancer outcomes.,...,Varenicline monotherapy is more effective after 12 weeks of treatment compar...,Venules have a larger lumen diameter than arterioles.,Venules have a thinner or absent smooth layer compared to arterioles.,Vitamin D deficiency effects the term of delivery.,Vitamin D deficiency is unrelated to birth weight.,Women with a higher birth weight are more likely to develop breast cancer la...,aPKCz causes tumour enhancement by affecting glutamine metabolism.,cSMAC formation enhances weak ligand signalling.,mTORC2 regulates intracellular cysteine levels through xCT inhibition.,p16INK4A accumulation is linked to an abnormal wound response caused by the...
1382,aPKCz causes tumour enhancement by affecting glutamine metabolism.,0-dimensional biomaterials show inductive properties.,"1,000 genomes project enables mapping of genetic sequence variation consisti...",1/2000 in UK have abnormal PrP positivity.,5% of perinatal mortality is due to low birth weight.,A deficiency of vitamin B12 increases blood levels of homocysteine.,A high microerythrocyte count raises vulnerability to severe anemia in homoz...,"A total of 1,000 people in the UK are asymptomatic carriers of vCJD infection.",ADAR1 binds to Dicer to cleave pre-miRNA.,AIRE is expressed in some skin tumors.,ALDH1 expression is associated with better breast cancer outcomes.,...,Varenicline monotherapy is more effective after 12 weeks of treatment compar...,Venules have a larger lumen diameter than arterioles.,Venules have a thinner or absent smooth layer compared to arterioles.,Vitamin D deficiency effects the term of delivery.,Vitamin D deficiency is unrelated to birth weight.,Women with a higher birth weight are more likely to develop breast cancer la...,aPKCz causes tumour enhancement by affecting glutamine metabolism.,cSMAC formation enhances weak ligand signalling.,mTORC2 regulates intracellular cysteine levels through xCT inhibition.,p16INK4A accumulation is linked to an abnormal wound response caused by the...
1385,cSMAC formation enhances weak ligand signalling.,0-dimensional biomaterials show inductive properties.,"1,000 genomes project enables mapping of genetic sequence variation consisti...",1/2000 in UK have abnormal PrP positivity.,5% of perinatal mortality is due to low birth weight.,A deficiency of vitamin B12 increases blood levels of homocysteine.,A high microerythrocyte count raises vulnerability to severe anemia in homoz...,"A total of 1,000 people in the UK are asymptomatic carriers of vCJD infection.",ADAR1 binds to Dicer to cleave pre-miRNA.,AIRE is expressed in some skin tumors.,ALDH1 expression is associated with better breast cancer outcomes.,...,Varenicline monotherapy is more effective after 12 weeks of treatment compar...,Venules have a larger lumen diameter than arterioles.,Venules have a thinner or absent smooth layer compared to arterioles.,Vitamin D deficiency effects the term of delivery.,Vitamin D deficiency is unrelated to birth weight.,Women with a higher birth weight are more likely to develop breast cancer la...,aPKCz causes tumour enhancement by affecting glutamine metabolism.,cSMAC formation enhances weak ligand signalling.,mTORC2 regulates intracellular cysteine levels through xCT inhibition.,p16INK4A accumulation is linked to an abnormal wound response caused by the...
1389,mTORC2 regulates intracellular cysteine levels through xCT inhibition.,0-dimensional biomaterials show inductive properties.,"1,000 genomes project enables mapping of genetic sequence variation consisti...",1/2000 in UK have abnormal PrP positivity.,5% of perinatal mortality is due to low birth weight.,A deficiency of vitamin B12 increases blood levels of homocysteine.,A high microerythrocyte count raises vulnerability to severe anemia in homoz...,"A total of 1,000 people in the UK are asymptomatic carriers of vCJD infection.",ADAR1 binds to Dicer to cleave pre-miRNA.,AIRE is expressed in some skin tumors.,ALDH1 expression is associated with better breast cancer outcomes.,...,Varenicline monotherapy is more effective after 12 weeks of treatment compar...,Venules have a larger lumen diameter than arterioles.,Venules have a thinner or absent smooth layer compared to arterioles.,Vitamin D deficiency effects the term of delivery.,Vitamin D deficiency is unrelated to birth weight.,Women with a higher birth weight are more likely to develop breast cancer la...,aPKCz causes tumour enhancement by affecting glutamine metabolism.,cSMAC formation enhances weak ligand signalling.,mTORC2 regulates intracellular cysteine levels through xCT inhibition.,p16INK4A accumulation is linked to an abnormal wound response caused by the...


# Retriever


*   BaseGraphRetriever(BaseComponent)
*   BaseRetriever(BaseComponent)
*   BM25Retriever(BaseRetriever)
*   FilterRetriever(BM25Retriever)
*   TfidfRetriever(BaseRetriever)
*   DensePassageRetriever(BaseRetriever)
*   TableTextRetriever(BaseRetriever)
*   EmbeddingRetriever(BaseRetriever)
*   Text2SparqlRetriever(BaseGraphRetriever)

See documentation at: https://github.com/deepset-ai/haystack/blob/master/docs/_src/api/api/retriever.md





In [None]:
# from haystack.nodes import DensePassageRetriever

# retriever = DensePassageRetriever(
#     document_store=document_store,
#     query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
#     passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
#     max_seq_len_query=64,
#     max_seq_len_passage=256,
#     batch_size=16,
#     use_gpu=True,
#     embed_title=True,
#     use_fast_tokenizers=True,
# )

# results = retriever.retrieve(corpus, queries)

from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval import models
from beir.retrieval.search.dense import DenseRetrievalExactSearch 

#### Dense Retrieval using SBERT (Sentence-BERT) ####
#### Provide any pretrained sentence-transformers model
#### The model was fine-tuned using cosine-similarity.
#### Complete list - https://www.sbert.net/docs/pretrained_models.html

model = DenseRetrievalExactSearch(models.SentenceBERT("msmarco-distilbert-base-v3"), batch_size=128)
retriever = EvaluateRetrieval(model, score_function="cos_sim")

#### Retrieve dense results (format of results is identical to qrels)
results = retriever.retrieve(corpus, queries)

In [22]:
logging.info("Retriever evaluation for k in: {}".format(retriever.k_values))
for k in retriever.k_values:
    ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
    print("Retriever evaluation for k in: {}".format(k))
    print(ndcg, _map, recall, precision)

Retriever evaluation for k in: 1
{'NDCG@1': 0.42333, 'NDCG@3': 0.48416, 'NDCG@5': 0.51037, 'NDCG@10': 0.53789, 'NDCG@100': 0.57592, 'NDCG@1000': 0.59134} {'MAP@1': 0.39944, 'MAP@3': 0.45935, 'MAP@5': 0.47679, 'MAP@10': 0.48894, 'MAP@100': 0.49742, 'MAP@1000': 0.49797} {'Recall@1': 0.39944, 'Recall@3': 0.52561, 'Recall@5': 0.58872, 'Recall@10': 0.67233, 'Recall@100': 0.846, 'Recall@1000': 0.96833} {'P@1': 0.42333, 'P@3': 0.19333, 'P@5': 0.13333, 'P@10': 0.07567, 'P@100': 0.0096, 'P@1000': 0.0011}
Retriever evaluation for k in: 3
{'NDCG@1': 0.42333, 'NDCG@3': 0.48416, 'NDCG@5': 0.51037, 'NDCG@10': 0.53789, 'NDCG@100': 0.57592, 'NDCG@1000': 0.59134} {'MAP@1': 0.39944, 'MAP@3': 0.45935, 'MAP@5': 0.47679, 'MAP@10': 0.48894, 'MAP@100': 0.49742, 'MAP@1000': 0.49797} {'Recall@1': 0.39944, 'Recall@3': 0.52561, 'Recall@5': 0.58872, 'Recall@10': 0.67233, 'Recall@100': 0.846, 'Recall@1000': 0.96833} {'P@1': 0.42333, 'P@3': 0.19333, 'P@5': 0.13333, 'P@10': 0.07567, 'P@100': 0.0096, 'P@1000': 0.0011

In [23]:
import random

#### Print top-k documents retrieved ####
top_k = 3

query_id, ranking_scores = random.choice(list(results.items()))
scores_sorted = sorted(ranking_scores.items(), key=lambda item: item[1], reverse=True)
print("Query : %s\n" % queries[query_id])

for rank in range(top_k):
    doc_id = scores_sorted[rank][0]
    # Format: Rank x: ID [Title] Body
    print("Rank %d: %s [%s] - %s\n" % (rank+1, doc_id, corpus[doc_id].get("title"), corpus[doc_id].get("text")))

Query : Sildenafil improves erectile function in men who experience sexual dysfunction as a result of the use of SSRI antidepressants.

Rank 1: 39281140 [Treatment of antidepressant-associated sexual dysfunction with sildenafil: a randomized controlled trial.] - CONTEXT Sexual dysfunction is a common adverse effect of antidepressants that frequently results in treatment noncompliance. OBJECTIVE To assess the efficacy of sildenafil citrate in men with sexual dysfunction associated with the use of selective and nonselective serotonin reuptake inhibitor (SRI) antidepressants. DESIGN, SETTING, AND PATIENTS Prospective, parallel-group, randomized, double-blind, placebo-controlled trial conducted between November 1, 2000, and January 1, 2001, at 3 US university medical centers among 90 male outpatients (mean [SD] age, 45 [8] years) with major depression in remission and sexual dysfunction associated with SRI antidepressant treatment. INTERVENTION Patients were randomly assigned to take silde