**Wide perspective query system focused on COVID-19
**

System that offers integral consultation of scientific papers on COVID-19 through search at document and passage level and with auxiliary visualization of the results.

The system has the following features:
* Simultaneous and complementary retrieval of documents (coarse grain) and passages (fine grain) relevant to queries.
* Visual representation of relevant documents and paragraphs according to their semantic content.
* Hybrid retrieval of paragraphs and answers.

Techniques:
* Recovery of documents through language models (Indri).
* Recovery of passages by combining language models (Indri) and re-ranking based on fine-tuned BERT.
* Visualization by means of embeddings and reduction of dimensions.

Contributions:
* Results and visualization according to different techniques that offer an enriched and wide perspective consultation.
* Fine-tuning by trainset built from titles and abstracts.



**Preprocess of collections
**
Filtering by keywords and json containing paragraphs.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
import sys
import re
import collections

import csv

# Generate regex patterns for filter words, for the moment we use a simple list of terms, this ought to be more sophisticated                                                                                                                                                                                     
filterwords=["2019-nCoV","COVID-19","novel coronavirus","SARS-CoV-2","Wuhan coronavirus"]
word_patterns=collections.OrderedDict()
for w in filterwords:
    wrdlc=w.lower()
    wrdptrn=re.compile(r"\b"+re.escape(wrdlc))
    word_patterns[wrdlc]=wrdptrn

    
output=[]
processed={}
metadata=csv.DictReader('kaggle/input/metadata.csv', dialect='excel')#delimiter='\t',  quoting=csv.QUOTE_NONE) #.drop_duplicates()                                                                                                  

of=open(out_file,"w", encoding='utf-8')
fieldnames=metadata.fieldnames
fieldnames.append('keywords_found')
sys.stderr.write("fields: {}\n".format(fieldnames))

wr=csv.DictWriter(of,fieldnames=fieldnames, dialect='excel')
wr.writeheader()

skipped=0
docs_found=0
file_problems=0
for row in metadata:
    proces_count+=1
        #if proces_count > 10:                                                                                                                                                                                     
        #    sys.exit(100)                                                                                                                                                                                         
        sys.stderr.write("\r {a:8d} documents processed".format(a=proces_count))
        #sys.stderr.write("\n document sha {} and pmcid {} --> \n row {}\n".format(row["sha"],row["pmcid"],row))                                                                                                   
        #we give preference to sha over pmc                                                                                                                                                                        
        file_id=row["sha"]
        file_type="pdf_json"
        if row["sha"] == None or row["sha"] == '':
            file_id=row["pmcid"]
            file_type="pmc_json"
            #sys.stderr.write("WARN: document {} has no sha {}\n".format(row["cord_uid"],row["sha"]))                                                                                                              

            if row["pmcid"] == None or row["pmcid"] == '':
                skipped+=1
                sys.stderr.write("WARN: document {} has neither sha nor pmcid, skipping ({})\n".format(row["cord_uid"],skipped))
                continue

        if file_id in processed:
            sys.stderr.write("WARN: document with file_id {} (sha or pmcid) already processed, skipping\n".format(file_id))
        else:
            processed[file_id]=1
            if row["sha"] != None and row["pmcid"] != None:
                processed[row["pmcid"]]=1

        extension=".json"
    
    
    
    
    
    
# Walk over the collection and find relevant documents, +
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

Document collection in TREC format.




Paragraph/passage collection in TREC format.



Indexing document collection
Indexing paragraph/passage collection

** Reranking by means of Fine-tuned BERT for sentence pair classification
**

Clinical BERT (Bio+ClinicalBERT) (Alsentzer et al., 2019) finetuned for sentence pair classification, over two datasets:
- Titles as questions + abstracts as answers, from CORD-19 kaggle dataset. 
- MedQuAD question answering dataset (Asma {Ben Abacha} and Dina Demner{-}Fushman, 2019)




The first step is preparing the dataset for finetuning BERT models. The title-abstract collection is straightforward, we just need to extract title and abstract pairs in BERT readable format. 

In [None]:
import csv
import random


titles={}
absts={}


with open('/data/input/metadata.csv') as tsvfile:    
    reader = csv.reader(tsvfile, delimiter=',',quotechar='"')
    #Don't print header
    next(reader)
    for row in reader:
        dokid=row[0]
        title=row[3]
        abstract=row[8]
        if abstract.strip() != "" and title.strip() != "": 
            titles[dokid]=title
            absts[dokid]=abstract


for dokid in titles:
    # positive examples
    print(titles[dokid]+"\t"+dokid+"\t"+absts[dokid]+"\t"+dokid+"\t1")
    #negative examples (1:10 positve:negative ratio)
    for i in range(10):
        dokid2=random.choice(list(absts.keys()))
        print(titles[dokid]+"\t"+dokid+"\t"+absts[dokid2]+"\t"+dokid2+"\t0")


Now for the finetuning part we use original BERT distribution [run_classifier.py](https://github.com/google-research/bert/blob/master/run_classifier.py) finetuning script, with a custom data processor very similar to the one used for MRPC dataset (only minimal changes done to the _create_examples function to adapt it to our needs.). 

In [None]:
class CovidKaggleProcessor(DataProcessor):
  """..."""

  def get_labels(self):
    """See base class."""
    return ["0", "1"]

  def _create_examples(self, lines, set_type):
    """Creates examples for the training and dev sets."""
    examples = []
    for (i, line) in enumerate(lines):
      guid = "%s-%s" % (set_type, i)
      text_a = tokenization.convert_to_unicode(line[0])
      text_b = tokenization.convert_to_unicode(line[2])
      if set_type == "test":
        label = tokenization.convert_to_unicode(line[4])
    else:
        label = tokenization.convert_to_unicode(line[4])
      examples.append(
          InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
    return examples


Finetuning is done over Bio+ClinicalBERT for 4 epochs, with a bastch size of 16 in order to fit into our GPU (GeForce 2080 RTX Ti). Following the exact command used:

In [None]:
BERT_BASE_DIR="clinicalBert/pretrained_bert_tf/biobert_pretrain_output_all_notes_150000"

python -u run_classifier.py  --task_name=covid \
       --do_train=true \
       --do_eval=true \
       --data_dir=$GLUE_DIR \
       --vocab_file=$BERT_BASE_DIR/vocab.txt \
       --bert_config_file=$BERT_BASE_DIR/bert_config.json \
       --init_checkpoint=$BERT_BASE_DIR/model.ckpt-150000 \
       --max_seq_length=128 \
       --train_batch_size=16 \
       --learning_rate=2e-5 \
       --num_train_epochs=4.0 \
       --output_dir=$GLUE_DIR/output-4e-1 \
       --do_lower_case=False \

Training results for reranking finetuning:


------------ Result of Bert fine-tuned model ----------

              precision    recall  f1-score   support

           0     0.9962    0.9976    0.9969     77791
           1     0.9761    0.9620    0.9690      7782

    accuracy                         0.9944     85573
   macro avg     0.9862    0.9798    0.9830     85573
weighted avg     0.9944    0.9944    0.9944     85573
