<a href="https://colab.research.google.com/github/Huertas97/TREC_COVID_sentence_transformers/blob/main/notebooks/TREC_COVID_BM25_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
This notebook shows a full pipeline for evaluating a Transformer-based model on TREC-COVID round 1. 

Scores from BM25 Okapi algorithm and Sentence Transformers embeddings are computed separately. Thus, the TREC-COVID metrics can also be computed for each strategy separately. 

In [1]:
!pip install -U -q sentence-transformers
!pip install -q scispacy
!pip install -q https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_sm-0.3.0.tar.gz
!pip install -U -q tqdm
!pip install rank-bm25
# tqdm._instances.clear()
# from importlib import reload
# logging.shutdown()
# reload(logging)

[K     |████████████████████████████████| 71kB 5.7MB/s 
[K     |████████████████████████████████| 1.3MB 10.8MB/s 
[K     |████████████████████████████████| 1.1MB 29.9MB/s 
[K     |████████████████████████████████| 890kB 40.6MB/s 
[K     |████████████████████████████████| 2.9MB 54.9MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 51kB 4.4MB/s 
[K     |████████████████████████████████| 10.4MB 8.3MB/s 
[K     |████████████████████████████████| 71kB 10.0MB/s 
[K     |████████████████████████████████| 13.0MB 32.7MB/s 
[K     |████████████████████████████████| 1.1MB 49.4MB/s 
[K     |████████████████████████████████| 194kB 53.4MB/s 
[K     |████████████████████████████████| 33.1MB 90kB/s 
[?25h  Building wheel for en-core-sci-sm (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 71kB 5.7MB/s 
[?25hCollecting rank

# Clone the github repository

In [2]:
!git clone https://github.com/Huertas97/TREC_COVID_sentence_transformers.git

Cloning into 'TREC_COVID_sentence_transformers'...
remote: Enumerating objects: 163, done.[K
remote: Counting objects: 100% (163/163), done.[K
remote: Compressing objects: 100% (145/145), done.[K
remote: Total 163 (delta 65), reused 38 (delta 16), pack-reused 0[K
Receiving objects: 100% (163/163), 10.57 MiB | 30.56 MiB/s, done.
Resolving deltas: 100% (65/65), done.


In [3]:
%cd TREC_COVID_sentence_transformers/

/content/TREC_COVID_sentence_transformers


# Download TREC-COVID DATA

In [4]:
# Script to download and preprocess CORD-19 documents for TREC-COVID task
!python ./scripts/build_trec_covid_data.py

2020-12-12 11:03:00.095375: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-12 11:03:02 - ------ Downloading docids-rnd1.txt ------
460kB [00:00, 1.07MB/s]
2020-12-12 11:03:03 - ------ Downloading topics-rnd1.xml ------
10.3kB [00:00, 6.80MB/s]       
2020-12-12 11:03:03 - ------ Downloading qrels-rnd1.txt ------
150kB [00:00, 782kB/s] 
2020-12-12 11:03:04 - ------ Downloading cord-19_2020-04-10.tar.gz ------
100% 1.52G/1.52G [01:40<00:00, 15.1MB/s]
2020-12-12 11:04:45 - ------ Uncompressing cord-19_2020-04-10.tar.gz ------
1.52GB [02:07, 11.9MB/s]                
2020-12-12 11:06:52 - ------ Uncompressing ./2020-04-10/noncomm_use_subset.tar.gz ------
76.4MB [00:03, 20.3MB/s]                
2020-12-12 11:06:56 - ------ Uncompressing ./2020-04-10/custom_license.tar.gz ------
660MB [01:22, 8.03MB/s]               
2020-12-12 11:08:18 - ------ Uncompressing ./2020-04-10/comm_use_subset.tar.gz ------
367MB [01:0

# BM25 Scores

In [5]:
# Script to compute BM25 Okapi scaled scores
!python ./scripts/bm25_trec_covid.py -a -t -f  --data ./trec_covid_data/df_docs.pkl

-------- Loading scispacy en_core_sci_sm model --------
-------- Building corpus --------
-------- Extracting topics --------
Corpus:   0% 0/3 [00:00<?, ?it/s]-------- Adding fulltext corpus to BM25 --------
Tokenized: 100% 5553/5553 [01:25<00:00, 65.24it/s]
-------- fulltext: BM25 scores for each topic --------
Topic: 100% 30/30 [00:01<00:00, 15.76it/s]
Corpus:  33% 1/3 [01:29<02:59, 89.95s/it]-------- Adding abstract corpus to BM25 --------
Tokenized: 100% 5553/5553 [00:05<00:00, 952.16it/s] 
-------- abstract: BM25 scores for each topic --------
Topic: 100% 30/30 [00:01<00:00, 20.01it/s]
Corpus:  67% 2/3 [01:37<01:05, 65.33s/it]-------- Adding title corpus to BM25 --------
Tokenized: 100% 5553/5553 [00:00<00:00, 7511.97it/s]
-------- title: BM25 scores for each topic --------
Topic: 100% 30/30 [00:01<00:00, 26.14it/s]
Corpus: 100% 3/3 [01:39<00:00, 33.27s/it]
2020-12-12 11:11:38 - -------- Summation BM25 scores for all corpus --------
2020-12-12 11:11:38 - -------- Scaling scores (m

Now we are going to create a rank score for the document with BM25 and sentence embeddings models between:

* query vs title
* query vs abstract
* query vs fulltext

<br>

* question vs title
* question vs abstract
* question vs fulltext

<br>

* narrative vs title
* narrative vs abstract
* narrative vs fulltext

As the corpus is the most computational consuming we will create a corpus embeddings of title with all the queries, then the abstract and finall fulltext. 

We are gonna try to compute an embedding of an abstract by split it in sentences and computing an embedding for each one. The bastract embedding will be the average of the sentence embeddings. 

#Cosine similarity 

Three scripts are availabe. Which one to use depends upon the model you want to use for computing embeddings (then used for calculating cosine similarity score). 

* If you want to use a single model from [Hugging Face](https://huggingface.co/models) or [Sentence Transformer](https://www.sbert.net/index.html) use `cos_sim_trec_covid.py`

* If you want to apply an ensemble of models described aboved use `ensemble_cos_sim_trec_covid.py`

* Finally, an ensemble of models applying a PCA is available with the script `ensemble_dim_red_cos_sim_trec_covid.py`. Pre-computed PCA are only avialble for multilingual models from Sentence Transformers:

  * distiluse-base-multilingual-cased
  * xlm-r-distilroberta-base-paraphrase-v1
  * xlm-r-bert-base-nli-stsb-mean-tokens
  * LaBSE
  * distilbert-multilingual-nli-stsb-quora-ranking

## Hugging Face model

In [None]:
# Script to compute cosine similarity scores for clinicalcovid-bert-nli from Huggingface
!python ./scripts/cos_sim_trec_covid.py -b 1000 -t -a -f \\
--data ./trec_covid_data/df_docs.pkl --model manueltonneau/clinicalcovid-bert-nli

## Ensemble

In [None]:
# Ensemble 5 models 
!python ./scripts/ensemble_cos_sim_trec_covid.py -b 1000 \\
-t -a --data ./trec_covid_data/df_docs.pkl \\
--model  distiluse-base-multilingual-cased,xlm-r-distilroberta-base-paraphrase-v1,xlm-r-bert-base-nli-stsb-mean-tokens,LaBSE,distilbert-multilingual-nli-stsb-quora-ranking

## Ensemble and PCA

In [4]:
# Help documentation
!python ./scripts/ensemble_dim_red_cos_sim_trec_covid.py --help

2020-12-12 11:35:16.098580: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1

Usage:

    python ensemble_dim_red_cos_sim_trec_covid.py [options] 

Options:
    -d, --data              Path to TREC-COVID parsed data
    -p, --pca               Path to dataframe with model names and PCAs
    -m, --model             Name of Transformer-based model from https://huggingface.co/pricing
    -f, --fulltext          Bool: Include fulltext corpus for BM25 scoring
    -a, --abstract          Bool: Include abstract corpus for BM25 scoring  
    -t, --title             Bool: Include titles corpus for BM25 scoring  
    -b, --batch             Batch size
    -h, --help              Help documentation

Example:
    python ./scripts/ensemble_dim_red_cos_sim_trec_covid.py -b 1000 -t -a --data ./trec_covid_data/df_docs.pkl --model distiluse-base-multilingual-cased,distilbert-multilingual-nli-stsb-quora-ranking


In [6]:
# Ensemble 5 models with PCA
!python ./scripts/ensemble_dim_red_cos_sim_trec_covid.py -b 1000 -t -a \\
--data ./trec_covid_data/df_docs.pkl \\
--pca ./PCA/df_multi_selected_99.pkl \\
--model distiluse-base-multilingual-cased,xlm-r-distilroberta-base-paraphrase-v1,xlm-r-bert-base-nli-stsb-mean-tokens,LaBSE,distilbert-multilingual-nli-stsb-quora-ranking

2020-12-12 07:59:56.436379: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
-------- Loading scispacy en_core_sci_sm model --------
2020-12-12 08:00:04 - -------- Loading SentenceTransformer model --------
2020-12-12 08:00:04 - Load pretrained SentenceTransformer: distiluse-base-multilingual-cased
2020-12-12 08:00:04 - Did not find folder distiluse-base-multilingual-cased
2020-12-12 08:00:04 - Try to download model from server: https://sbert.net/models/distiluse-base-multilingual-cased.zip
2020-12-12 08:00:04 - Downloading sentence transformer model from https://sbert.net/models/distiluse-base-multilingual-cased.zip and saving it at /root/.cache/torch/sentence_transformers/sbert.net_models_distiluse-base-multilingual-cased
100% 504M/504M [00:17<00:00, 28.5MB/s]
2020-12-12 08:00:27 - Load SentenceTransformer from folder: /root/.cache/torch/sentence_transformers/sbert.net_models_distiluse-base-multilingual-cased
2020-1

In [7]:
# Ensemble 2 best multilingual models from STSb
!python ./scripts/ensemble_dim_red_cos_sim_trec_covid.py -b 1000 -t -a \\
--data ./trec_covid_data/df_docs.pkl \\
--pca ./PCA/df_multi_selected_99.pkl \\
--model xlm-r-distilroberta-base-paraphrase-v1,xlm-r-bert-base-nli-stsb-mean-tokens

2020-12-12 08:33:29.989419: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
-------- Loading scispacy en_core_sci_sm model --------
2020-12-12 08:33:43 - -------- Loading SentenceTransformer model --------
2020-12-12 08:33:43 - Load pretrained SentenceTransformer: xlm-r-distilroberta-base-paraphrase-v1
2020-12-12 08:33:43 - Did not find folder xlm-r-distilroberta-base-paraphrase-v1
2020-12-12 08:33:43 - Try to download model from server: https://sbert.net/models/xlm-r-distilroberta-base-paraphrase-v1.zip
2020-12-12 08:33:43 - Load SentenceTransformer from folder: /root/.cache/torch/sentence_transformers/sbert.net_models_xlm-r-distilroberta-base-paraphrase-v1
2020-12-12 08:34:20 - Use pytorch device: cuda
2020-12-12 08:34:20 - Load pretrained SentenceTransformer: xlm-r-bert-base-nli-stsb-mean-tokens
2020-12-12 08:34:20 - Did not find folder xlm-r-bert-base-nli-stsb-mean-tokens
2020-12-12 08:34:20 - Try to download mod

In [6]:
# Ensemble of 3 best models in TREC-COVID
!python ./scripts/ensemble_dim_red_cos_sim_trec_covid.py -b 1000 -t -a \\
--data ./trec_covid_data/df_docs.pkl \\
--pca ./PCA/df_multi_selected_99.pkl \\
--model distiluse-base-multilingual-cased,LaBSE,distilbert-multilingual-nli-stsb-quora-ranking

2020-12-12 11:11:40.660634: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
-------- Loading scispacy en_core_sci_sm model --------
2020-12-12 11:11:48 - -------- Loading SentenceTransformer model --------
2020-12-12 11:11:48 - Load pretrained SentenceTransformer: distiluse-base-multilingual-cased
2020-12-12 11:11:48 - Did not find folder distiluse-base-multilingual-cased
2020-12-12 11:11:48 - Try to download model from server: https://sbert.net/models/distiluse-base-multilingual-cased.zip
2020-12-12 11:11:48 - Downloading sentence transformer model from https://sbert.net/models/distiluse-base-multilingual-cased.zip and saving it at /root/.cache/torch/sentence_transformers/sbert.net_models_distiluse-base-multilingual-cased
100% 504M/504M [00:06<00:00, 75.4MB/s]
2020-12-12 11:12:00 - Load SentenceTransformer from folder: /root/.cache/torch/sentence_transformers/sbert.net_models_distiluse-base-multilingual-cased
2020-1

# Top k results

Remember to put a relevant output name to the file created with the topk results (`-o` option). This file will be used to evaluate TREC-COVID expert-judgment scores. 

In [12]:
# Script to extract topk scores for each topic
!python ./scripts/topk_trec_covid.py --data ./trec_covid_data/df_docs.pkl \\
-p ./results -f df_BM25_sc.pkl,df_cos_sim_sc_distiluse-base-multilingual-cased_LaBSE_distilbert-multilingual-nli-stsb-quora-ranking.pkl \\
-o ensemble_3_models

2020-12-12 11:29:10 - -------- Retrieving top 1000 scores for each topic --------
2020-12-12 11:29:10 - -------- Saving results in ./results/ensemble_3_models --------
2020-12-12 11:29:10 - -------- Finished --------


# EVAL TREC-COVID

Clone the official repository and change the `.txt` file name with the scores desired to evaluate. 

In [8]:
!git clone https://github.com/usnistgov/trec_eval.git
%cd trec_eval/
!make
%cd ..

Cloning into 'trec_eval'...
remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 763 (delta 5), reused 3 (delta 0), pack-reused 749[K
Receiving objects: 100% (763/763), 679.52 KiB | 2.05 MiB/s, done.
Resolving deltas: 100% (491/491), done.
/content/TREC_COVID_sentence_transformers/trec_eval
gcc -g -I.  -Wall -DVERSIONID=\"9.0.7\"  -o trec_eval trec_eval.c formats.c meas_init.c meas_acc.c meas_avg.c meas_print_single.c meas_print_final.c get_qrels.c get_trec_results.c get_prefs.c get_qrels_prefs.c get_qrels_jg.c form_res_rels.c form_res_rels_jg.c form_prefs_counts.c utility_pool.c get_zscores.c convert_zscores.c measures.c  m_map.c m_P.c m_num_q.c m_num_ret.c m_num_rel.c m_num_rel_ret.c m_gm_map.c m_Rprec.c m_recip_rank.c m_bpref.c m_iprec_at_recall.c m_recall.c m_Rprec_mult.c m_utility.c m_11pt_avg.c m_ndcg.c m_ndcg_cut.c m_Rndcg.c m_ndcg_rel.c m_binG.c m_G.c m_rel_P.c m_success.c m

All official metrics.
Remember to change the `.txt` file name with the one you want to evaluate

In [13]:
!./trec_eval/trec_eval ./trec_covid_data/qrels-rnd1.txt ./results/ensemble_3_models.txt

runid                 	all	TFM
num_q                 	all	30
num_ret               	all	30000
num_rel               	all	2352
num_rel_ret           	all	1160
map                   	all	0.1893
gm_map                	all	0.1307
Rprec                 	all	0.2497
bpref                 	all	0.3588
recip_rank            	all	0.8378
iprec_at_recall_0.00  	all	0.8601
iprec_at_recall_0.10  	all	0.5361
iprec_at_recall_0.20  	all	0.3905
iprec_at_recall_0.30  	all	0.2651
iprec_at_recall_0.40  	all	0.1730
iprec_at_recall_0.50  	all	0.1067
iprec_at_recall_0.60  	all	0.0427
iprec_at_recall_0.70  	all	0.0182
iprec_at_recall_0.80  	all	0.0031
iprec_at_recall_0.90  	all	0.0000
iprec_at_recall_1.00  	all	0.0000
P_5                   	all	0.6333
P_10                  	all	0.5367
P_15                  	all	0.4778
P_20                  	all	0.4450
P_30                  	all	0.3822
P_100                 	all	0.2050
P_200                 	all	0.1300
P_500                 	all	0.0649
P_1000                	all

ndcg metrics

In [14]:
!./trec_eval/trec_eval -m ndcg_cut ./trec_covid_data/qrels-rnd1.txt ./results/ensemble_3_models.txt

ndcg_cut_5            	all	0.6059
ndcg_cut_10           	all	0.5297
ndcg_cut_15           	all	0.4820
ndcg_cut_20           	all	0.4542
ndcg_cut_30           	all	0.4096
ndcg_cut_100          	all	0.3538
ndcg_cut_200          	all	0.3820
ndcg_cut_500          	all	0.4207
ndcg_cut_1000         	all	0.4547
