# Scientific document retrieval system for COVID-19 using the  COVID-19 Open Research Dataset (CORD-19)  

# Download the dataset

## About the dataset

The [dataset](https://github.com/allenai/cord19) made by Allen AI is a corpus of academic papers about COVID-19 and related coronavirus research. 
The related paper explains the methodology (Lucy Lu Wang et al., 2020) [link text](https://www.aclweb.org/anthology/2020.nlpcovid19-acl.1/)


## script to download and extract the dataset

In [None]:
%%capture
import nltk
nltk.download("popular")
import os
import csv
import os
import json
from collections import defaultdict

In [None]:
# dowload and extract
def download_CORD19_dataset(version, url="https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_{version}.tar.gz"):
    url = url.format(version=version)
    ziped_file_name = "cord-19_"+version+".tar.gz"
    file_name = version

    # download dataset
    if not os.path.exists(ziped_file_name):
        print("Downloading: "+url)
        import urllib.request
        urllib.request.urlretrieve(url, ziped_file_name)
    else:
        print("Dataset already downloaded")

    # extract dataset
    if not os.path.exists(file_name):
        print("Unzipping dataset file")
        import tarfile
        tar = tarfile.open(ziped_file_name)
        tar.extractall()
        tar.close()
    else:
        print("Dataset already unziped")

    # extract json files
    if not os.path.exists(file_name+"/document_parses"):
        print("Unzipping dataset document parses")
        import tarfile
        tar = tarfile.open(file_name+"/document_parses"+".tar.gz")
        tar.extractall(path="./"+file_name)
        tar.close()
    else:
        print("Document parses already unziped")


## Pytorch dataset class

We will create a dataset class that access the files of the documents we want directly without needing to open all the files, to save memory from this enormous dataset.

The method of opening the CSV file and extracting the main body can be found [here](https://github.com/allenai/cord19)

The dataset indexes will return the part of the document that the user wants to.

For example, if we set the part of doc to be "title" then dataset[3] will return the title of the 3rd document as a list of strings.

Regarding the selection of scientific documents, we will select papers that have been published after December 2019 as then the new virus was identified. 

For this reason, since then all the articles are related to COVID-19.

In [None]:
import pandas as pd
pd.options.display.max_colwidth = 10000
import torch
import torch.nn as nn
import numpy as np
import time
from torch.utils.data import DataLoader

class Document_Dataset(torch.utils.data.Dataset):
    def __init__(self, dataset):
        # we need a list of sentences
        if isinstance(dataset, list): 
            self.dataset = dataset
        else:
            self.dataset = [dataset]

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):        
        return self.dataset[index]

class CORD_Dataset_(torch.utils.data.Dataset):
    def __init__(self, dataset_path, limit=-1):
        # dataset path
        self.path = dataset_path
        # a limit of elements to read
        self.limit = limit
        self.part_of_doc = 0

        # open the csv file
        df = pd.read_csv(self.path+"/metadata.csv")

        """
        select documents after December of 2019
        and also have a valid pdf json body text
        """
        self.df = df[(df['publish_time'] > '2019-12-30') & (df['pdf_json_files'].notnull()) & (df['title'].notnull()) ]

        # if limit = -1 then we need to find the true size of the dataset
        if limit == -1:
            # open the csv file and get the columns of cord_uid
            self.limit = len(self.df["cord_uid"])
            # get the keys
            self.keys = self.df["cord_uid"].tolist()
        else:
            # get the keys
            self.keys = self.df["cord_uid"].tolist()[0:self.limit]


    def __len__(self):
        # len is the limit 
        return self.limit

    def __getitem__(self, index):  

        row = self.df.iloc[[index]]

        if self.part_of_doc == 'title':
            return Document_Dataset( [ row['title'].to_string(index=False) ] )

        elif self.part_of_doc == 'authors':
            return Document_Dataset( row['authors'].to_string(index=False).split('; ') )

        elif self.part_of_doc == 'abstract':
        # tokenize senteences of abstract
            return Document_Dataset( nltk.sent_tokenize(row['abstract'].to_string(index=False)) )

        elif self.part_of_doc == 'main_body':
        # access the full text (if available) for Intro
            main_body = []
            if not row['pdf_json_files'].empty:
                for json_path in row['pdf_json_files'].to_string(index=False).lstrip().split('; '):
                    # print(json_path)
                    with open(version+"/"+json_path) as f_json:
                        full_text_dict = json.load(f_json)
                        
                        # grab main_body text section from *some* version of the full text
                        # and tokenize to sentences
                        for paragraph_dict in full_text_dict['body_text']:
                            paragraph_text = paragraph_dict['text']
                            # tokenize sentences with nltk
                            sentence_text = nltk.sent_tokenize(paragraph_text) 
                            main_body.extend(sentence_text)

                        # stop searching other copies of full text if already got main_body
                        if main_body:
                            break

            return Document_Dataset( main_body )

class CORD_Dataset(torch.utils.data.Dataset):
    def __init__(self, dataset_path, limit=-1):

        # instantiate a dataset 
        self.dataset = CORD_Dataset_(version, limit)
        self.limit = self.dataset.limit
        self.keys = self.dataset.keys

    def __len__(self):
        # 4 parts of doc we have title, authors, abstract, main body
        return 4

    def __getitem__(self, part_of_doc):  
        # set the part of doc for the CORD dataset
        self.dataset.part_of_doc = part_of_doc    
        return self.dataset


## Get the dataset class


In [None]:
version = "2020-05-26"
download_CORD19_dataset(version)
dataset = CORD_Dataset(version, 5000)
keys = dataset.keys

Downloading: https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2020-05-26.tar.gz
Unzipping dataset file
Unzipping dataset document parses


  exec(code_obj, self.user_global_ns, self.user_ns)


## Functions to compute and save the embeddings

We want to feed all the sentences of our dataset to the model and then extract the outputs. Because the dataset is very large and google colab may interrupt the runtime we will save every time the embeddings to a unique file.
This process will happen 3 times one for the title, one for the abstract, and one for the main body. The 3 separated files will be created which are going to be under the same directory, unique for every document.


First of all, we will create a list of datasets that are the same part of the document like title, abstract, and main body.

Then we will pass this list to a function that finds the embeddings for every dataset and saves the tensor to a given directory. 

In [None]:

def save_embeddings(embeddings, directory, key, part_of_doc):
    path = directory+"/"+key
    file_name = part_of_doc+".pt"
    # if directory doenst exist make one
    if not os.path.exists(path):
        os.makedirs(path)
    # save torch array
    torch.save(embeddings, os.path.join(path, file_name))



def get_save_embeddings(model, tokenizer, dataset, keys, directory, part_of_doc):

    # get the embeddings and save embeddings for every document in dataset
    for i in range(len(dataset)):

        dataloader = DataLoader(dataset[i], batch_size=4, shuffle=False, pin_memory=True, num_workers=4, drop_last=False, collate_fn=tokenizer)

        # if embeddings are already computed then dont compute them again
        if not os.path.exists(directory+"/"+keys[i]+"/"+part_of_doc+".pt"):
            print(keys[i], i)
            t = time.time()
            embeddings = get_embeddings(model, dataloader)
            e = time.time()
            save_embeddings(embeddings, directory, keys[i], part_of_doc)  
            s = time.time()
            print(e-t, s-e)
    



# 1st method pre-trained BERT embeddings

For our first method, we will feed our sentences to a BERT model [(Devlin et al., 2018)](https://arxiv.org/abs/1810.04805) and use the output as word embeddings.

## Why BERT

BERT is a method of pre-training language representations.
It is a general-purpose "language understanding" model that can be used on tasks like question answering. BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.

Because the model is unsupervised it can be trained on enormous datasets from the internet like Wikipedia.

BERT model can produce representations of a sentence that are content-free or contextual with unidirectionality or bidirectionality.
The previous GloVe embeddings that we used didn't provide the context that we needed to produce meaningful embeddings.


BERT was built based on the recent work of ELMo and ULMFit.
These models were still but whether these models were unidirectional and shallowly bidirectional. The great advantage of BERT over these methods is that it uses transformers. Transformers use multi-head attention with feed-forward neural blocks. By multiplying the representations of each word to each word and get the inner product we can map a better relationship of their context combining with all the other words of the sentence. This method has been proven state-of-the-art and can produce better contextual embeddings. 




### About the implemetation


We will use the base pre-trained BERT model with 768 output layers. Also, we will take the average of all the outputs so that the new output will be an 768 dimension array of embeddings for each sentence. We will use the Uncased model because it's better unless we need to know case information which is important for tasks like (e.g., Named Entity Recognition or Part-of-Speech tagging).
The part of downloading the pre-trained and tokenizing the words is done by the [huggingface library](https://huggingface.co/transformers/index.html).


## Install the [huggingface library](https://huggingface.co/transformers/index.html) library and dowload the large model

In [None]:
%%capture
!pip install transformers
import torch
import torch.nn as nn
import numpy as np
import time
import sys
import transformers
from transformers import BertTokenizer, BertModel, BertTokenizerFast
from torch.utils.data import DataLoader

#### Checking for gpu *usage*

In [None]:
# gpu for pytorch
device = torch.device('cpu')
if torch.cuda.is_available():
    device = torch.device('cuda')
print(device)

cuda


### Get the pre-trained BERT model  

In [None]:
def get_embeddings(model, dataloader):
    # create an empty array sto store the ouputs of the model
    # its size must be the size of the dataloader
    data_array = torch.empty((len(dataloader.dataset), model.output_dim))
    batch_size = dataloader.batch_size
    for batch_idx, data in enumerate(dataloader):
        # Load the data to gpu
        # t = time.time()
        data["input_ids"] = data["input_ids"].to(device)
        data["attention_mask"] = data["attention_mask"].to(device)
        data["token_type_ids"] = data["token_type_ids"].to(device)
        model_output = model(data).cpu()

        # append to the data array
        data_array[batch_idx*batch_size : (batch_idx*batch_size) + len(model_output)] = model_output 
        # print(time.time() - t)
    # return the results from all the bathces
    return  data_array

class BERT_Model(torch.nn.Module):
    def __init__(self, pre_trained, tokenizer):

        super(BERT_Model, self).__init__()
        self.bert = pre_trained
        self.tokenizer = tokenizer
        # base bert output
        self.output_dim = self.bert.config.hidden_size

    def forward(self, x):
        """
        we average the last output layer of the pretrained bert model
        """
        model_output = self.bert(**x).last_hidden_state.detach()
        model_output = torch.mean(model_output, dim=1)
        return model_output

    def encode(self, sentences):
        """
        encode the given sentences
        we do not use that during training in order to use the parallelization provided by the torch dataloader
        """
        tokenizes_sentences = self.tokenizer(sentences)
        tokenizes_sentences = tokenizes_sentences.to(device)
        return self.forward(tokenizes_sentences)

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

def BERT_tokenizer(batch):
    return bert_tokenizer(batch, return_tensors="pt", padding=True, truncation=True)

BERT_model = BERT_Model(bert_model, BERT_tokenizer)
BERT_model.to(device)

## Compute all the BERT embeddings

### Title

In [None]:
BERT_directory = "/content/drive/MyDrive/BERT_embeddings"

part_of_doc = "title"

get_save_embeddings(BERT_model, BERT_tokenizer,  dataset[part_of_doc], keys, BERT_directory, part_of_doc)

### Abstract

In [None]:
part_of_doc = "abstract"

get_save_embeddings(BERT_model, BERT_tokenizer,  dataset[part_of_doc], keys, BERT_directory, part_of_doc)

### Main body

In [None]:
part_of_doc = "main_body"

get_save_embeddings(BERT_model, BERT_tokenizer,  dataset[part_of_doc], keys, BERT_directory, part_of_doc)

### Zip embeddings

In [None]:
%%capture
!zip -r "/content/drive/MyDrive/BERT_embeddings.zip" "/content/drive/MyDrive/BERT_embeddings"

In [None]:
!du -sh  /content/drive/MyDrive/BERT_embeddings

2.4G	/content/drive/MyDrive/BERT_embeddings


In [None]:
!ls /content/drive/MyDrive/BERT_embeddings | wc -l

4998


# 2nd method GloVe embeddings

## Why GloVe embeddings

[GloVe](https://nlp.stanford.edu/projects/glove/) is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. 

Glove can create generalized representations for the words by taking into consideration the word's statistical occurrence in the text.

These embeddings might not have a lot of contextual information as BERT's had but they are easy to compute and a lot faster to find a similar vector because it has less than half the dimension of BERT's vector.

### Implementation details

As for the implementation, the embedding layer outputs a 300 dimension vector for every word that we throw into it. Storing 300*n values for every sentence is way over inefficient, so we will take the average and save only a 300 dimension vector for every sentence.

## Download the pre-trained glove embeddings

In [None]:
import torch
import torch.nn as nn
import numpy as np
import time
import sys
import json
import pickle
import os
from torch.utils.data import DataLoader

### Functions that download the embeddings files

In [None]:
# download and unzip embedding vectors file
import json
import pickle
from torch.utils.data import DataLoader

def download_embeddings(file):
    if not os.path.exists(file+".zip"):
        print("Download glove "+file)
        import urllib.request
        urllib.request.urlretrieve("http://nlp.stanford.edu/data/"+file+".zip", file+".zip")
    else:
        print("File already downloaded")

    if not os.path.exists(file+".txt"):
        print("Unzip glove file")
        import zipfile
        zip_ref = zipfile.ZipFile(file+".zip", 'r')
        zip_ref.extractall()
        zip_ref.close()
    else:
        print("File already unziped")

# get vectros arrays and ids of the vocabulary
def get_embedings_vocab_id_array(file, embeddings_dim):

    # check if embeddings dict is saved
    if not os.path.exists("embeddings_dict.pickle"):
        # get embeddings dict
        print("Getting embeddings dict")
        embeddings_dict = {}
        with open(file+".txt", 'r') as f:
            for line in f:
                values = line.split(" ")
                word = values[0]
                vector = np.asarray(values[1:], np.float32)
                embeddings_dict[word] = vector
        with open('embeddings_dict.pickle', 'wb') as f:
            pickle.dump(embeddings_dict, f)

    else:
        # if saved load it
        print("Getting embeddings dict from pickle")
        with open('embeddings_dict.pickle', 'rb') as f:
            embeddings_dict = pickle.load(f)


    # get vovabulary, id, embedings array
    print("Getting vovabulary, id, embedings array")
    # set vector for unknown
    embeddings_dict["<unk>"] = np.random.normal(scale = 0.420, size=(embeddings_dim,  ))
    # set vectori for pad
    embeddings_dict["<pad>"] = np.random.normal(scale = 0.69, size=(embeddings_dim,   ))
    # vovabulary in set for faster indexing
    vocabulary_words = set(embeddings_dict.keys())

    # get the vocabulary id
    vocabulary_words_id = {w:i for i,w in enumerate(vocabulary_words)}

    # get the embeddings array to feed in the embeddings layer
    embeddings_array = np.empty((len(vocabulary_words), embeddings_dim), dtype=np.float32)
    for index, embedding_array in enumerate(embeddings_dict.values()):
        embeddings_array[index] = embedding_array

    return vocabulary_words, vocabulary_words_id, embeddings_array



#### Download the pre-trained embeddigns for the glove model

In [None]:
glove_file = "glove.840B.300d"
embeddings_dim = 300
download_embeddings(glove_file)
vocabulary, vocabulary_id, embeddings_array = get_embedings_vocab_id_array(glove_file, embeddings_dim)

Download glove glove.840B.300d
Unzip glove file
Getting embeddings dict
Getting vovabulary, id, embedings array


#### Checking for GPU usage

In [None]:
# gpu for pytorch
device = torch.device('cpu')
if torch.cuda.is_available():
    device = torch.device('cuda')
print(device)

cuda


### Get the pre-trained GLOVE model

In [None]:
def get_embeddings(model, dataloader):
    # create an empty array sto store the ouputs of the model
    # its size must be the size of the dataloader
    data_array = torch.empty((len(dataloader.dataset), model.output_dim))
    batch_size = dataloader.batch_size
    for batch_idx, data in enumerate(dataloader):
        # Load the data to gpu
        # t = time.time()
        data = data.to(device)
        model_output = model(data).cpu()

        # append to the data array
        data_array[batch_idx*batch_size : (batch_idx*batch_size) + len(model_output)] = model_output 
        # print(time.time()-t)
    # return the results from all the bathces
    return  data_array

class GLOVE_Model(torch.nn.Module):
    def __init__(self, embeddings_array, tokenizer, trainable=False):
        super(GLOVE_Model, self).__init__()

        self.embedding = nn.Embedding.from_pretrained(torch.from_numpy(embeddings_array))
        self.embedding.weight.requires_grad = trainable
        self.tokenizer = tokenizer
        # fixed output size 300
        self.output_dim = 300

    def forward(self, x):
        """
        we average the output layer of the pretrained glove model
        """
        model_output = self.embedding(x).detach()
        model_output = torch.mean(model_output, dim=1)
        return model_output

    def encode(self, sentences):
        """
        encode the given sentences
        we do not use that during training in order to use the parallelization provided by the torch dataloader
        """
        tokenizes_sentences = self.tokenizer(sentences)
        tokenizes_sentences = tokenizes_sentences.to(device)
        return self.forward(tokenizes_sentences)

def GLOVE_tokenizer(sentences, vocabulary_id=vocabulary_id, vocabulary=vocabulary):
    # transform text to tokenized vocabulary ids
    tokenized_sentences = []
    for sentence in sentences:
        tok_sentence = [vocabulary_id[w] if w in vocabulary else vocabulary_id["<unk>"] for w in sentence.split()]
        tokenized_sentences.append(tok_sentence)

    # create a list of torch vectors
    tokenized_vectors = [torch.tensor(vector) for vector in tokenized_sentences]
    
    # add padding
    padded_tokenized_vectors = torch.nn.utils.rnn.pad_sequence(tokenized_vectors, padding_value=vocabulary_id["<pad>"], batch_first=True)
    return padded_tokenized_vectors

GLOVE_model = GLOVE_Model(embeddings_array, GLOVE_tokenizer)
GLOVE_model.to(device)


GLOVE_Model(
  (embedding): Embedding(2196018, 300)
)

## Compute all the GloVe embeddings

### Title

In [None]:
GLOVE_directory = "/content/drive/MyDrive/GLOVE_embeddings"
GLOVE_directory = "GLOVE_embeddings_test"

part_of_doc = "title"

get_save_embeddings(GLOVE_model, GLOVE_tokenizer,  dataset[part_of_doc], keys, GLOVE_directory, part_of_doc)

### Abstract

In [None]:
part_of_doc = "abstract"

get_save_embeddings(GLOVE_model, GLOVE_tokenizer,  dataset[part_of_doc], keys, GLOVE_directory, part_of_doc)

### Main body

In [None]:
part_of_doc = "main_body"

get_save_embeddings(GLOVE_model, GLOVE_tokenizer,  dataset[part_of_doc], keys, GLOVE_directory, part_of_doc)

### Zip embeddings

In [None]:
%%capture
!zip -r "/content/drive/MyDrive/GLOVE_embeddings.zip" "/content/drive/MyDrive/GLOVE_embeddings"

In [None]:
!du -sh  /content/drive/MyDrive/GLOVE_embeddings

971M	/content/drive/MyDrive/GLOVE_embeddings


In [None]:
!ls /content/drive/MyDrive/GLOVE_embeddings | wc -l

4998


# 3rd method pretrained BERT large model, on a corpus of messages from Twitter about COVID-19.

## About the model

COVID-Twitter-BERT (CT-BERT) is a transformer-based model pretrained on a large corpus of Twitter messages on the topic of COVID-19. The v2 model is trained on 97M tweets (1.2B training examples).

When used on domain specific datasets our evaluation shows that this model will get a marginal performance increase of 10–30% compared to the standard BERT-Large-model. Most improvements are shown on COVID-19 related and on Twitter-like messages.

The github page of this model can be found [here](https://github.com/digitalepidemiologylab/covid-twitter-bert)

Using a trained model on texts related to the coronavirus we can have a better understanding of our sentences as it will be known specific words that are only related to it. For example the text that has already been trained in BERT words like coronavirus are quite rare. Therefore, by using this ready-made model on these topics, we will have a better semantic representation of the propositions in vectors.

As for the implementation, we will use the hugging face library. This model has been uploaded to the huggingface repository and we can download it from there. After that, we will follow the same steps as we followed using the BERT embeddings.

## Install the [huggingface library](https://huggingface.co/transformers/index.html) library and dowload the large model

We will use the Uncased model because it's better unless we need to know case information which is important for tasks like (e.g., Named Entity Recognition or Part-of-Speech tagging).



In [None]:
%%capture
!pip install transformers
import torch
import torch.nn as nn
import numpy as np
import time
import sys
import transformers
from transformers import BertTokenizer, BertModel, BertTokenizerFast
from torch.utils.data import DataLoader

#### Checking for gpu *usage*

In [None]:
# gpu for pytorch
device = torch.device('cpu')
if torch.cuda.is_available():
    device = torch.device('cuda')
print(device)

cuda


### Get the pre-trained BERT model  

In [None]:
def get_embeddings(model, dataloader):
    # create an empty array sto store the ouputs of the model
    # its size must be the size of the dataloader
    data_array = torch.empty((len(dataloader.dataset), model.output_dim))
    batch_size = dataloader.batch_size
    for batch_idx, data in enumerate(dataloader):
        # Load the data to gpu
        # t = time.time()
        data["input_ids"] = data["input_ids"].to(device)
        data["attention_mask"] = data["attention_mask"].to(device)
        data["token_type_ids"] = data["token_type_ids"].to(device)
        model_output = model(data).cpu()

        # append to the data array
        data_array[batch_idx*batch_size : (batch_idx*batch_size) + len(model_output)] = model_output 
        # print(time.time() - t)
    # return the results from all the bathces
    return  data_array

class BERT_Model(torch.nn.Module):
    def __init__(self, pre_trained, tokenizer):

        super(BERT_Model, self).__init__()
        self.bert = pre_trained
        self.tokenizer = tokenizer
        # base bert output
        self.output_dim = self.bert.config.hidden_size

    def forward(self, x):
        """
        we average the last output layer of the pretrained bert model
        """
        model_output = self.bert(input_ids=x["input_ids"], attention_mask=x["attention_mask"] ).last_hidden_state.detach()
        model_output = torch.mean(model_output, dim=1)
        return model_output

    def encode(self, sentences):
        """
        encode the given sentences
        we do not use that during training in order to use the parallelization provided by the torch dataloader
        """
        tokenizes_sentences = self.tokenizer(sentences)
        tokenizes_sentences = tokenizes_sentences.to(device)
        return self.forward(tokenizes_sentences)


twitter_bert_tokenizer = BertTokenizer.from_pretrained('digitalepidemiologylab/covid-twitter-bert-v2', model_max_length=512)
twitter_bert_model = BertModel.from_pretrained('digitalepidemiologylab/covid-twitter-bert-v2')

def COVID_TWITER_BERT_tokenizer(batch):
    return twitter_bert_tokenizer(batch, return_tensors="pt", padding=True, truncation=True)

COVID_TWITER_BERT_model = BERT_Model(twitter_bert_model, COVID_TWITER_BERT_tokenizer)
COVID_TWITER_BERT_model.to(device)
pass

## Compute all the BERT embeddings

### Title

In [None]:
COVID_TWITER_BERT_directory = "/content/drive/MyDrive/COVID_TWITER_BERT_embeddings"

part_of_doc = "title"

get_save_embeddings(COVID_TWITER_BERT_model, COVID_TWITER_BERT_tokenizer,  dataset[part_of_doc], keys, COVID_TWITER_BERT_directory, part_of_doc)

### Abstract

In [None]:
part_of_doc = "abstract"

get_save_embeddings(COVID_TWITER_BERT_model, COVID_TWITER_BERT_tokenizer,  dataset[part_of_doc], keys, COVID_TWITER_BERT_directory, part_of_doc)

### Main body

In [None]:
part_of_doc = "main_body"

get_save_embeddings(COVID_TWITER_BERT_model, COVID_TWITER_BERT_tokenizer,  dataset[part_of_doc], keys, COVID_TWITER_BERT_directory, part_of_doc)

### Zip embeddings

In [None]:
%%capture
# w81ysjf9
!zip -r "/content/drive/MyDrive/COVID_TWITER_BERT_embeddings.zip" "/content/drive/MyDrive/COVID_TWITER_BERT_embeddings"

In [None]:
!du -sh  /content/drive/MyDrive/COVID_TWITER_BERT_embeddings

3.2G	/content/drive/MyDrive/COVID_TWITER_BERT_embeddings


In [None]:
!ls /content/drive/MyDrive/COVID_TWITER_BERT_embeddings | wc -l

4998


# 4th method pre-trained Sentence-BERT embeddings

For our last method, we will feed our sentences to a Sentence-BERT model [(Nils Reimers and Iryna Gurevych, 2019)](https://arxiv.org/pdf/1908.10084.pdf) and use the output as word embeddings.

## Why Sentence-BERT

Sentence-BERT(SBERT) as described by the paper of [(Nils Reimers and Iryna Gurevych, 2019)](https://arxiv.org/pdf/1908.10084.pdf) is a  modification of the pre-trained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. 

BERT produces semantically meaningful embeddings but it had never been trained to do that exact thing. It was trained as a language model and a model that could predict if a sentence is semantically relevant to another. 

SBERT takes that advantage of BERT and it is trained to a labeled dataset in order to fine-tune its ability to produce quality embeddings.

In order to fine-tune BERT the authors of the paper created siamese and triplet networks and updated the weights such that the produced sentence embeddings are semantically meaningful and can be compared with cosine-similarity.

This method beat the average embeddings of BERT in the STS score and also the Glove average embeddings.



### About the implemetation


We will use the base pre-trained S-BERT model with 768 output layers. Also, we will take the average of all the outputs so that the new output will be an 768 dimension array of embeddings for each sentence.

However the sentence transformers library can do this job for us and we will be using it to get the embeddings from the sentences.

## Install the [sentence transformers](https://github.com/UKPLab/sentence-transformers) library and dowload the base distil model

In [None]:
%%capture
!pip install sentence-transformers --upgrade
import torch
import torch.nn as nn
import numpy as np
import time
import sys
from sentence_transformers import SentenceTransformer
from torch.utils.data import DataLoader

#### Checking for gpu *usage*

In [None]:
# gpu for pytorch
device = torch.device('cpu')
if torch.cuda.is_available():
    device = torch.device('cuda')
print(device)

cuda


### Get the pre-trained S-BERT model  

In [None]:
def get_embeddings(model, dataloader):
    # create an empty array sto store the ouputs of the model
    # its size must be the size of the dataloader
    data_array = torch.empty((len(dataloader.dataset), 768))
    batch_size = dataloader.batch_size
    for batch_idx, data in enumerate(dataloader):
        # Load the data to gpu
        # t = time.time()

        model_output = model.encode(data)

        # append to the data array
        data_array[batch_idx*batch_size : (batch_idx*batch_size) + len(model_output)] = torch.from_numpy(model_output)
        # print(time.time() - t)
        break
    # return the results from all the bathces
    return  data_array


SBERT_model = SentenceTransformer('stsb-distilbert-base')

# no need for special tokenizer
def SBERT_tokenizer(batch):
    return batch

SBERT_model.to(device)


## Compute all the SBERT embeddings

### Title

In [None]:
SBERT_directory = "/content/drive/MyDrive/SBERT_embeddings"

part_of_doc = "title"

get_save_embeddings(SBERT_model, SBERT_tokenizer,  dataset[part_of_doc], keys, SBERT_directory, part_of_doc)

### Abstract

In [None]:
part_of_doc = "abstract"

get_save_embeddings(SBERT_model, SBERT_tokenizer,  dataset[part_of_doc], keys, SBERT_directory, part_of_doc)

### Main body

In [None]:
part_of_doc = "main_body"

get_save_embeddings(SBERT_model, SBERT_tokenizer,  dataset[part_of_doc], keys, SBERT_directory, part_of_doc)

### Zip embeddings

In [None]:
%%capture
!zip -r "/content/drive/MyDrive/SBERT_embeddings.zip" "/content/drive/MyDrive/SBERT_embeddings"

In [None]:
!du -sh  /content/drive/MyDrive/SBERT_embeddings

2.4G	/content/drive/MyDrive/SBERT_embeddings


In [None]:
!ls /content/drive/MyDrive/SBERT_embeddings | wc -l

4998


# About the measure of similarity

To calculate the similarity we will calculate the cosine similarity for each sentence from each scientific document.

We have initially saved all the representation vectors of the sentences from each article. Then we scan each vector and measure the similarity with the question. Finally we will find the sentence with the greatest similarity and view the document that had this sentence in it.

## Methods that get the saved embeddings and load them in a new dataset

With the methods below, we load the embeddings from a given directory to associate them with the given queries

In [None]:
def extract_saved_doc_embeddings(directory, key, part_of_document=["title", "abstract", "main_body"]):
    # find the embeddings save torch files
    path = directory+"/"+key
    files = []
    part_of_document = [p+".pt" for p in part_of_document]
    paths = os.listdir(path)
    # add title
    for file in paths:
        if file.endswith(".pt") and (file in part_of_document) and file=="title.pt":
            files.append(os.path.join(path, file))
    # add abstract
    for file in paths:
        if file.endswith(".pt") and (file in part_of_document) and file=="abstract.pt":
            files.append(os.path.join(path, file))
    # add main body
    for file in paths:
        if file.endswith(".pt") and (file in part_of_document) and file=="main_body.pt":
            files.append(os.path.join(path, file))


    # open the torch files and concat all in one array
    embedding_arrays = []
    for embedding_array in files:
        embedding_arrays.append( torch.load(embedding_array) )

    # concat all the tensors
    document_embedding_arrays = torch.cat( embedding_arrays, 0)
    return document_embedding_arrays



class Saved_Embeddings_Dataset(torch.utils.data.Dataset):
    def __init__(self, directory, keys, part_of_document=["title", "abstract", "main_body"]):
        self.directory = directory
        self.keys = keys
        self.part_of_document = part_of_document 
        

    def __len__(self):
        return len(self.keys)

    def __getitem__(self, index):        
        return extract_saved_doc_embeddings(self.directory, self.keys[index], self.part_of_document)

## Similarity Comparator

Cosine similarity is a simple modifier to compare the similarity between 2 vectors. We could use other ways to find the similarity such as running the BERT model but this would be very time consuming and costly.

In [None]:
class Cosine_Similarity_Comparator(torch.nn.Module):
    def __init__(self):

        super(Cosine_Similarity_Comparator, self).__init__()
        self.CosineSimilarity = torch.nn.CosineSimilarity(dim=-1, eps=1e-6)
        self.indexes_passed = 0
        self.max_index_found_at = [0, 0]
        self.max_similarity = torch.tensor(-1)

    def forward(self, question, data):

        similarity = self.CosineSimilarity(question.reshape(1,-1), data)

        temp_max = similarity.max()
        temp_argmax = torch.argmax(similarity)
        if temp_max > self.max_similarity:
            self.max_index_found_at = self.indexes_passed, temp_argmax
            self.max_similarity = temp_max
        # print(self.max)
        self.indexes_passed += 1

# scan all the embeddings dataset and find the sentences with the highest score
def get_most_similar_document_index(question, model, dataset, comparator=Cosine_Similarity_Comparator()):

    # get a loader of size 1 to scan one document each time
    loader = DataLoader(dataset, batch_size=1, shuffle=False, pin_memory=True, num_workers=4, drop_last=False)

    # start a timer
    start_time = time.time()

    # encode the question with embeddigns
    encoded_question = model.encode([question])
    if not isinstance(encoded_question, torch.Tensor):
        encoded_question = torch.from_numpy(encoded_question)    
    encoded_question = encoded_question.to(device)

    for batch_idx, data in enumerate(loader):
        # tim = time.time()
        # pass data to device if gpu is available
        data = data.to(device)
        # comporator keeps the results
        comparator(encoded_question, data)
        # print(time.time() - tim, batch_idx)

    end_time = time.time()
    query_time = end_time - start_time

    # and returns the indexe of the best one
    return comparator.max_index_found_at, (comparator.max_similarity, query_time)

# Let's ask some questions on both embedding methods

## Methods that print the title and the relevant passages of the document

In [None]:

def print_similar_title_and_document(indexes, dataset, similarity, window=2):

    document_index = indexes[0]
    print("Document found with similarity [%f] in time [%f] seconds"% similarity)
    title = dataset["title"][document_index][:][0]
    print("Title\n{"+title+"}")

    # lets find from the index the sentence in the document
    # get document sentences by summing all the document part sentences into one
    sentence_document = [title]
    sentence_document.extend(dataset["abstract"][document_index][:])
    sentence_document.extend(dataset["main_body"][document_index][:])

    index = indexes[1]
    # print a small window before and after the most similar sentence
    print("Main body")
    for w in reversed(range(window)):
        if index-w-1 > 0:
            print(sentence_document[index-w-1])
    print(">>>{"+sentence_document[index]+"}<<<")
    for w in range(window):
        if index+w+1 < len(sentence_document):
            print(sentence_document[index+w+1])

def get_similar_title_and_document(question, model, dataset, embeddings_dataset):
    comparator = Cosine_Similarity_Comparator()
    comparator.to(device)
    model.to(device)
    indexes, similarity = get_most_similar_document_index(question, model, embeddings_dataset, comparator)
    print_similar_title_and_document(indexes, dataset, similarity)

question = "I don't like coronovirus"


## Load the embeddings datasets

Transfer the ziped embeddings to the local storage space and unzip them

In [None]:
%%capture
!cp -R /content/drive/MyDrive/BERT_embeddings.zip ./
!cp -R /content/drive/MyDrive/GLOVE_embeddings.zip ./
!cp -R /content/drive/MyDrive/COVID_TWITER_BERT_embeddings.zip ./
!cp -R /content/drive/MyDrive/SBERT_embeddings.zip ./


!unzip "BERT_embeddings.zip" 
!unzip "GLOVE_embeddings.zip" 
!unzip "COVID_TWITER_BERT_embeddings.zip" 
!unzip "SBERT_embeddings.zip" 

In [None]:
BERT_directory = "content/drive/MyDrive/BERT_embeddings"
GLOVE_directory = "content/drive/MyDrive/GLOVE_embeddings"
COVID_TWITER_BERT_directory = "content/drive/MyDrive/COVID_TWITER_BERT_embeddings"
SBERT_directory = "content/drive/MyDrive/SBERT_embeddings"


BERT_embeddings_dataset = Saved_Embeddings_Dataset(BERT_directory, keys)
GLOVE_embeddings_dataset = Saved_Embeddings_Dataset(GLOVE_directory, keys)
COVID_TWITER_BERT_embeddings_dataset = Saved_Embeddings_Dataset(COVID_TWITER_BERT_directory, keys)
SBERT_embeddings_dataset = Saved_Embeddings_Dataset(SBERT_directory, keys)

## Questions

### What are the coronoviruses

In [None]:
question = "What are the coronoviruses"

print("********************BERT METHOD********************")
get_similar_title_and_document(question, BERT_model, dataset, BERT_embeddings_dataset)
print("\n")

print("********************GloVe METHOD********************")
get_similar_title_and_document(question, GLOVE_model, dataset, GLOVE_embeddings_dataset)
print("\n")

print("********************COVID TWITER BERT METHOD********************")
get_similar_title_and_document(question, COVID_TWITER_BERT_model, dataset, COVID_TWITER_BERT_embeddings_dataset)
print("\n")

print("********************SBERT METHOD********************")
get_similar_title_and_document(question, SBERT_model, dataset, SBERT_embeddings_dataset)

********************BERT METHOD********************
Document found with similarity [0.787640] in time [70.818873] seconds
Title
{ Teschovirus}
Main body
>>>{ Teschoviruses are emerging pathogens, belonging to the family Picornaviridae, and infects porcine population only.}<<<
Among all, porcine teschoviruses (PTVs) are of high prominence leading to clinical illness and consequent economic loss to the livestock sector.
These are associated with extremely lethal non-suppurative polioencephalomyelitis (Teschen disease) and are distributed world over.


********************GloVe METHOD********************
Document found with similarity [0.657842] in time [28.851360] seconds
Title
{ The biomechanical role of extra-axonemal structures in shaping the flagellar beat of Euglena}
Main body
Since H 1 = 0 we must have (c 1 , c 2 ) = (0, 0).
However, in this case, (24) admits the unique solution U = 0, which is incompatible with (25) .
>>>{If H 1 = 0 then the boundary conditions impose c 1 = 0, but

### What was discovered in Wuhuan in December 2019

In [None]:
question = "What was discovered in Wuhuan in December 2019"

print("********************BERT METHOD********************")
get_similar_title_and_document(question, BERT_model, dataset, BERT_embeddings_dataset)
print("\n")

print("********************GloVe METHOD********************")
get_similar_title_and_document(question, GLOVE_model, dataset, GLOVE_embeddings_dataset)
print("\n")

print("********************COVID TWITER BERT METHOD********************")
get_similar_title_and_document(question, COVID_TWITER_BERT_model, dataset, COVID_TWITER_BERT_embeddings_dataset)
print("\n")

print("********************SBERT METHOD********************")
get_similar_title_and_document(question, SBERT_model, dataset, SBERT_embeddings_dataset)

********************BERT METHOD********************
Document found with similarity [0.791220] in time [70.831984] seconds
Title
{ Social distance and SARS memory: impact on the public awareness of 2019 novel coronavirus (COVID-19) outbreak}
Main body
Xilingol League in Inner Mongolia ranked 4 th , with a retention rate at 103%.
Xilingol is far away from Wuhan in terms of social distance, but it was struck by SARS.
>>>{It is worth noting that a confirmed case of plague was reported in Xilingol on Nov 16 th , 2019, only 45 days before the Wuhan outbreak.}<<<
The effects of social distance and SARS memory on the lead-time advantage are estimated according to Eq.
4, controlled by Euclidean distances, GDP per capita and the city's administrative level (Table 1) .


********************GloVe METHOD********************
Document found with similarity [0.638674] in time [28.845056] seconds
Title
{ Human antibodies neutralizing diphtheria toxin in vitro and in vivo}
Main body
Unbound antibodies 

### What is Coronovirus Disease 2019

In [None]:
question = "What is Coronovirus Disease 2019"

print("********************BERT METHOD********************")
get_similar_title_and_document(question, BERT_model, dataset, BERT_embeddings_dataset)
print("\n")

print("********************GloVe METHOD********************")
get_similar_title_and_document(question, GLOVE_model, dataset, GLOVE_embeddings_dataset)
print("\n")

print("********************COVID TWITER BERT METHOD********************")
get_similar_title_and_document(question, COVID_TWITER_BERT_model, dataset, COVID_TWITER_BERT_embeddings_dataset)
print("\n")

print("********************SBERT METHOD********************")
get_similar_title_and_document(question, SBERT_model, dataset, SBERT_embeddings_dataset)

********************BERT METHOD********************
Document found with similarity [0.801401] in time [70.817739] seconds
Title
{ Performing Structural Heart Disease Interventions During the Coronavirus Disease 2019 (COVID-19) Pandemic – But What Are the Downsides?}
Main body
>>>{ Performing Structural Heart Disease Interventions During the Coronavirus Disease 2019 (COVID-19) Pandemic – But What Are the Downsides?}<<<
 NaN
Angiography and Interventions consensus statement on triage considerations for patients referred for structural heart disease (SHD) intervention during the current coronavirus disease 2019 (COVID-19) pandemic by Shah et al (1) .


********************GloVe METHOD********************
Document found with similarity [0.677447] in time [28.797335] seconds
Title
{ Method for Active Pandemic Curve Management (MAPCM)}
Main body
These numbers are guesstimates, but can be replaced by reliable data.
There are conflicting reports about fatality rates.
>>>{While low tests number

### What is COVID-19

In [None]:
question = "What is COVID-19"

print("********************BERT METHOD********************")
get_similar_title_and_document(question, BERT_model, dataset, BERT_embeddings_dataset)
print("\n")

print("********************GloVe METHOD********************")
get_similar_title_and_document(question, GLOVE_model, dataset, GLOVE_embeddings_dataset)
print("\n")

print("********************COVID TWITER BERT METHOD********************")
get_similar_title_and_document(question, COVID_TWITER_BERT_model, dataset, COVID_TWITER_BERT_embeddings_dataset)
print("\n")

print("********************SBERT METHOD********************")
get_similar_title_and_document(question, SBERT_model, dataset, SBERT_embeddings_dataset)

********************BERT METHOD********************
Document found with similarity [0.830035] in time [70.819024] seconds
Title
{ All about COVID-19 in brief}
Main body
>>>{ All about COVID-19 in brief}<<<
 NaN
A new coronavirus was discovered due to detection of an unfamiliar pneumonia in a group of patients in December 2019 in Wuhan, China, initially named as 2019 novel coronavirus (2019-nCoV) by the World Health Organization (WHO) on 7 January.


********************GloVe METHOD********************
Document found with similarity [0.621421] in time [28.818199] seconds
Title
{ Evaluation of "stratify and shield" as a policy option for ending the COVID-19 lockdown in the UK}
Main body
The 196 sensitivity (1 − p 1 ) of the classifier, with threshold set so that 15% of the population 197 will be classified as high risk, is the maximal proportion of deaths that can be prevented 198 by a stratify-and shield-policy optimally applied, in comparison with an unselective 199 lifting of social d

### What is caused by SARS-COV2

In [None]:
question = "What is caused by SARS-COV2"

print("********************BERT METHOD********************")
get_similar_title_and_document(question, BERT_model, dataset, BERT_embeddings_dataset)
print("\n")

print("********************GloVe METHOD********************")
get_similar_title_and_document(question, GLOVE_model, dataset, GLOVE_embeddings_dataset)
print("\n")

print("********************COVID TWITER BERT METHOD********************")
get_similar_title_and_document(question, COVID_TWITER_BERT_model, dataset, COVID_TWITER_BERT_embeddings_dataset)
print("\n")

print("********************SBERT METHOD********************")
get_similar_title_and_document(question, SBERT_model, dataset, SBERT_embeddings_dataset)

********************BERT METHOD********************
Document found with similarity [0.877689] in time [70.867692] seconds
Title
{ Patient-derived mutations impact pathogenicity of SARS-CoV-2}
Main body
>>>{ Patient-derived mutations impact pathogenicity of SARS-CoV-2}<<<
 The sudden outbreak of the severe acute respiratory syndrome-coronavirus (SARS-CoV-2) has spread globally with more than 1,300,000 patients diagnosed and a death toll of 70,000.
Current genomic survey data suggest that single nucleotide variants (SNVs) are abundant.


********************GloVe METHOD********************
Document found with similarity [0.666858] in time [28.800757] seconds
Title
{ MATHEMATICAL MODELING FOR TRANSMISSIBILITY OF COVID-19 VIA MOTORCYCLES}
Main body
https://doi.org/10.1101/2020.04.
18.20070797 doi: medRxiv preprint Case II.2 .
>>>{If there exists a natural number k : a k = 0, then the solution of Equation 3.11 is zero for t > t k , i.e., all solutions in spite of the initial value x 0 coinc

### How is COVID-19 spread

In [None]:
question = "How is COVID-19 spread"

print("********************BERT METHOD********************")
get_similar_title_and_document(question, BERT_model, dataset, BERT_embeddings_dataset)
print("\n")

print("********************GloVe METHOD********************")
get_similar_title_and_document(question, GLOVE_model, dataset, GLOVE_embeddings_dataset)
print("\n")

print("********************COVID TWITER BERT METHOD********************")
get_similar_title_and_document(question, COVID_TWITER_BERT_model, dataset, COVID_TWITER_BERT_embeddings_dataset)
print("\n")

print("********************SBERT METHOD********************")
get_similar_title_and_document(question, SBERT_model, dataset, SBERT_embeddings_dataset)

********************BERT METHOD********************
Document found with similarity [0.793027] in time [70.825008] seconds
Title
{ Spread of COVID-19 in India: A Simple Algebraic Study}
Main body
>>>{ Spread of COVID-19 in India: A Simple Algebraic Study}<<<
 The number of patients, infected with COVID-19, began to increase very rapidly in India from March 2020.
The country was put under lockdown from 25 March 2020.


********************GloVe METHOD********************
Document found with similarity [0.715870] in time [28.922324] seconds
Title
{ COVID-19: Spatial Analysis of Hospital Case-Fatality Rate in France}
Main body
Lethality depends on the intrinsic virulence of the virus but, unlike morbidity, it does 56 not depend on its contagiousness.
Virulence comes from the reproductive capacity of 57 the virus in the cell, its capacity for cellular degradation, and its ability to induce or not 58 an innate or specific immune response.
>>>{Virulence is of purely biological origin and once

### Where was COVID-19 discovered

In [None]:
question = "Where was COVID-19 discovered"

print("********************BERT METHOD********************")
get_similar_title_and_document(question, BERT_model, dataset, BERT_embeddings_dataset)
print("\n")

print("********************GloVe METHOD********************")
get_similar_title_and_document(question, GLOVE_model, dataset, GLOVE_embeddings_dataset)
print("\n")

print("********************COVID TWITER BERT METHOD********************")
get_similar_title_and_document(question, COVID_TWITER_BERT_model, dataset, COVID_TWITER_BERT_embeddings_dataset)
print("\n")

print("********************SBERT METHOD********************")
get_similar_title_and_document(question, SBERT_model, dataset, SBERT_embeddings_dataset)

********************BERT METHOD********************
Document found with similarity [0.782045] in time [70.834580] seconds
Title
{ Exploring the spread dynamics of COVID-19 inMorocco}
Main body
>>>{ Exploring the spread dynamics of COVID-19 inMorocco}<<<
 Despite some similarities of the dynamic of COVID-19 spread in Morocco and other countries, the infection, recovery and death rates remain very variable.
In this paper, we analyze the spread dynamics of COVID-19 in Morocco within a standard susceptible-exposed-infected-recovered-death (SEIRD) model.


********************GloVe METHOD********************
Document found with similarity [0.600120] in time [28.832526] seconds
Title
{ Arguing from Ignorance}
Main body
This is the hallmark of all heuristics.
They are 'fast and frugal' procedures that do not expend the resources of their more systematic counterparts in reasoning (Gigerenzer and Goldstein 1996) .
>>>{The scientist or health worker who must respond to an emerging infectious dis

### How does coronavirus spread

In [None]:
question = "How does coronavirus spread"


print("********************BERT METHOD********************")
get_similar_title_and_document(question, BERT_model, dataset, BERT_embeddings_dataset)
print("\n")

print("********************GloVe METHOD********************")
get_similar_title_and_document(question, GLOVE_model, dataset, GLOVE_embeddings_dataset)
print("\n")

print("********************COVID TWITER BERT METHOD********************")
get_similar_title_and_document(question, COVID_TWITER_BERT_model, dataset, COVID_TWITER_BERT_embeddings_dataset)
print("\n")

print("********************SBERT METHOD********************")
get_similar_title_and_document(question, SBERT_model, dataset, SBERT_embeddings_dataset)

********************BERT METHOD********************
Document found with similarity [0.763760] in time [70.987698] seconds
Title
{ Coronaviruses pandemics: Can neutralizing antibodies help?}
Main body
>>>{ Coronaviruses pandemics: Can neutralizing antibodies help?}<<<
 Abstract For the first time in Homo sapiens history, possibly, most of human activities is stopped by coronavirus disease 2019 (COVID-19).
Nearly eight billion people of this world are facing a great challenge, maybe not “to be or not to be” yet, but unpredictable.


********************GloVe METHOD********************
Document found with similarity [0.703812] in time [28.815904] seconds
Title
{ In silico approach toward the identification of unique peptides from viral protein infection: Application to COVID-19}
Main body
As shown in Figure 2A more peptides were identified from the nucleocapsid protein than any other protein, followed by the Spike protein.
When the first SARS-CoV-2 study was released we reprocessed this s

# Evaluate the results

As for the evaluation, we will evaluate the models based on the time we needed to compute the embeddings and to find the best sentence and also we will evaluate them based on the relativity of the answers that we took from them.

## Computation time

For the computation time, it is very obvious that the bigger the model the more time it takes to calculate the embeddings. In more detail, the BERT together with the SBERT models took about the same time as they are models of approximately the same size with 768 dimensions. GLOVE needed the least time to compute as it is a simple neural network with 1 layer and its representations had only 300 dimensions. Finally, the pre-trained model on Twitter in texts related to COVID-19 was very difficult to run. To calculate the representations needed more than 6 hours. 

About the time the models took to find a best-matched article, the search time is proportional to the size of the embeddings. BERT and SBERT having the same time having 768 dimensions were in the middle, GLOVE was the fastest as it has only 300 dimensions, and Twitter pre-trained was the slowest as it has 1024 dimensions.

## Relativity 

The relevance between the questions and the results was not quite satisfactory.

 More specifically, GLOVE had the worst results. It may have been the fastest but the statistical relationship between the words is not enough to describe their meaning. 

After GLOVE in stagnation comes the simple BERT. The output of the final layer proved to be good enough to semantically describe the content of all the sentences. In some cases, in fact, it has better proposals than all the other BERT models that were modified better for the this task.



Finally, the last two models, the S-BERT and the Twitter pre-trained BERT give the best results. Their differences are small. The Twitter pre-trained BERT, as it was trained in relevant texts, includes in its vocabulary the unusual words for the coronavirus, and several times it seems to work like a bag of words. With a slight difference, the S-BERT is better as its sentences have a relation between them and the question. S-BERT had been trained for this very reason so it would be logical to have the best results.

## Conclusion


One reason we have could be that the models have not been trained in scientific medical records that relate exclusively to COVID19. Another reason perhaps more important could be the way we choose the most relevant document. Cosine similarity is a very simple and greedy way the like does not receive any semantic information among the vectors. Maybe some of the proposals had a greater relationship with each other and just did not have the better cosine similarity. This evaluation method leaves the models to do the heavy work for it, but the models proved unable for this task.

One way to improve search performance could be to further train these models on COVID19 data. Also, another way would be to use a different metric of cosine similarity to compare the different embeddings.

# Resources

CORD-19
https://www.aclweb.org/anthology/2020.nlpcovid19-acl.1.pdf

S-BERT
https://arxiv.org/pdf/1908.100Bert84.pdf

GloVe
https://nlp.stanford.edu/projects/glove/

Ηuggingface library
https://huggingface.co/bert-base-uncased

ΒΕΡΤ model
https://arxiv.org/abs/1810.04805

BERT pretrained on Twiiter for COVID related tweets
https://github.com/digitalepidemiologylab/covid-twitter-bert
