<p align="center">
 <img src="http://www.di.uoa.gr/themes/corporate_lite/logo_el.png" title="Department of Informatics and Telecommunications - University of Athens"/> </p>

---
<h1 align="center"> 
  Artificial Intelligence
</h1>
<h1 align="center" > 
  Deep Learning for Natural Language Processing
</h1>

---
<h2 align="center"> 
 <b>Konstantinos Nikoletos</b>
</h2>

<h3 align="center"> 
 <b>Winter 2020-2021</b>
</h3>


---
---


### __Task__ 
This exercise is about developing a document retrieval system to return titles of scientific
papers containing the answer to a given user question. You will use the first version of
the COVID-19 Open Research Dataset (CORD-19) in your work (articles in the folder
comm use subset).


For example, for the question “What are the coronaviruses?”, your system can return the
paper title “Distinct Roles for Sialoside and Protein Receptors in Coronavirus Infection”
since this paper contains the answer to the asked question.


To achieve the goal of this exercise, you will need first to read the paper Sentence-BERT:
Sentence Embeddings using Siamese BERT-Networks, in order to understand how you
can create sentence embeddings. In the related work of this paper, you will also find other
approaches for developing your model. For example, you can using Glove embeddings,
etc. In this link, you can find the extended versions of this dataset to test your model, if
you want. You are required to:


<ol type="a">
  <li>Preprocess the provided dataset. You will decide which data of each paper is useful
to your model in order to create the appropriate embeddings. You need to explain
your decisions.</li>
  <li>Implement at least 2 different sentence embedding approaches (see the related work
of the Sentence-BERT paper), in order for your model to retrieve the titles of the
papers related to a given question.</li>
  <li>Compare your 2 models based on at least 2 different criteria of your choice. Explain
why you selected these criteria, your implementation choices, and the results. Some
questions you can pose are included here. You will need to provide the extra questions
you posed to your model and the results of all the questions as well.</li>
</ol>

### __Notebook__ 


In this notebook I am going to implement Sentence Bert for the CORD-19 dataset


---
---

***Implemented in colab (dark mode)***

__Import__ of essential libraries


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 
import sys # only needed to determine Python version number
import matplotlib # only needed to determine Matplotlib version 
import nltk
from nltk.stem import WordNetLemmatizer
import pprint
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext import data
import logging
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Selecting device (GPU - CUDA if available)

In [None]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


# Loading data
---

In [None]:
# Opening data file
import io
from google.colab import drive
from os import listdir
from os.path import isfile, join
import json

drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


Loading the dictionary if it has been created

In [None]:
#@title Select number of papers that will be feeded in the model { vertical-output: true, display-mode: "both" }
number_of_papers = "9000" #@param ["1000", "3000", "6000", "9000"]
import pickle

CORD19_Dataframe = r"/content/drive/My Drive/AI_4/CORD19_SentenceMap_"+number_of_papers+".pkl"
with open(CORD19_Dataframe, 'rb') as drivef:
  CORD19Dictionary = pickle.load(drivef)

OR the summary of the papers

In [None]:
#@title Select number of summarized papers that will be feeded in the model { vertical-output: true, display-mode: "both" }
number_of_papers = "9000" #@param ["1000", "3000", "6000", "9000"]
import pickle

CORD19_Dataframe = r"/content/drive/My Drive/AI_4/CORD19_SentenceMap_Summarized_"+number_of_papers+".pkl"
with open(CORD19_Dataframe, 'rb') as drivef:
  CORD19Dictionary = pickle.load(drivef)

## Queries
---

In [None]:
query_list = [
  'What are the coronoviruses?',
  'What was discovered in Wuhuan in December 2019?',
  'What is Coronovirus Disease 2019?',
  'What is COVID-19?',
  'What is caused by SARS-COV2?', 'How is COVID-19 spread?',
  'Where was COVID-19 discovered?','How does coronavirus spread?'
]

proposed_answers = [
  'Coronaviruses (CoVs) are common human and animal pathogens that can transmit zoonotically and cause severe respiratory disease syndromes. ',
  'In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.',
  'Coronavirus Disease 2019 (COVID-19) is an emerging disease with a rapid increase in cases and deaths since its first identification in Wuhan, China, in December 2019.',
  'COVID-19 is a viral respiratory illness caused by a new coronavirus called SARS-CoV-2.',
  'Coronavirus disease (COVID-19) is caused by SARS-COV2 and represents the causative agent of a potentially fatal disease that is of great global public health concern.', 
  'First, although COVID-19 is spread by the airborne route, air disinfection of cities and communities is not known to be effective for disease control and needs to be stopped.',
  'In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.',
  'The new coronavirus was reported to spread via droplets, contact and natural aerosols from human-to-human.'
]

myquery_list = [
  "How long can the coronavirus survive on surfaces?",
  "What means COVID-19?",
  "Is COVID19 worse than flue?",
  "When the vaccine will be ready?",
  "Whats the proteins that consist COVID-19?",
  "Whats the symptoms of COVID-19?",
  "How can I prevent COVID-19?",
  "What treatments are available for COVID-19?",
  "Is hand sanitizer effective against COVID-19?",
  "Am I at risk for serious complications from COVID-19 if I smoke cigarettes?",
  "Are there any FDA-approved drugs (medicines) for COVID-19?",
  "How are people tested?",
  "Why is the disease being called coronavirus disease 2019, COVID-19?",
  "Am I at risk for COVID-19 from mail, packages, or products?",
  "What is community spread?",
  "How can I protect myself?",
  "What is a novel coronavirus?",
  "Was Harry Potter a good magician?"
]

# Results dataframes

In [None]:
resultsDf = pd.DataFrame(columns=['Number of papers','Embeddings creation time'])

queriesDf = pd.DataFrame(columns=['Query','Proposed_answer','Model_answer','Cosine_similarity'])
queriesDf['Query'] = query_list
queriesDf['Proposed_answer'] = proposed_answers

myQueriesDf = pd.DataFrame(columns=['Query','Model_answer','Cosine_similarity'])
myQueriesDf['Query'] = myquery_list

queriesDf

Unnamed: 0,Query,Proposed_answer,Model_answer,Cosine_similarity
0,What are the coronoviruses?,Coronaviruses (CoVs) are common human and anim...,,
1,What was discovered in Wuhuan in December 2019?,"In December 2019, a novel coronavirus, called ...",,
2,What is Coronovirus Disease 2019?,Coronavirus Disease 2019 (COVID-19) is an emer...,,
3,What is COVID-19?,COVID-19 is a viral respiratory illness caused...,,
4,What is caused by SARS-COV2?,Coronavirus disease (COVID-19) is caused by SA...,,
5,How is COVID-19 spread?,"First, although COVID-19 is spread by the airb...",,
6,Where was COVID-19 discovered?,"In December 2019, a novel coronavirus, called ...",,
7,How does coronavirus spread?,The new coronavirus was reported to spread via...,,


# SBERT
---

In [None]:
!pip install -U sentence-transformers

Requirement already up-to-date: sentence-transformers in /usr/local/lib/python3.6/dist-packages (0.4.1.2)


# Sentence Embeddings using Siamese BERT-Networks
---
__How does SentenceBERT work?__

Let’s start by looking at the architecture of SentenceBERT, which I will call SBERT from here on. SBERT is a so-called twin network which allows it to process two sentences in the same way, simultaneously. These two twins are identical down to every parameter (their weight is tied), which allows us to think about this architecture as a single model used multiple times. (I think the reason for the twin formulation is due to mathematical ease, but I need to read the original paper for figuring it out)

It becomes apparent from the image that BERT makes up the base of this model, to which a pooling layer has been appended. This pooling layer enables us to create a fixed-size representation for input sentences of varying lengths. The authors experimented with different pooling strategies; MEAN- and MAX pooling or utilising the CLS token BERT per default already generates. How these perform and compare will be discussed later.
Since the purpose of creating these fixed-size sentence embeddings was to encode their semantics did the authors fine-tune their network on Semantic Textual Similarity data. 

<p align="center">
 <img src="https://miro.medium.com/max/700/1*LGlrGO9b5Mt1X3-ZUyalng.png" title="Department of Informatics and Telecommunications - University of Athens"/> </p>


Generally,

__Sentence embeddings__ are a similar concept. It embeds a full sentence into a vector space. These sentence embeddings retain some nice properties, as they inherit features from their underlying word embeddings.
There is a huge trend lately for a quest for Universal Embeddings: embeddings that are pre-trained on a large corpus and can be plugged in a variety of downstream task models (sentimental analysis, classification, translation…) to automatically improve their performance by incorporating some general word/sentence representations learned on the larger dataset.
While unsupervised representation learning of sentences had been the norm for quite some time, last year has seen a shift toward supervised and multi-task learning schemes with a number of very interesting proposals in late 2017/early 2018.



 Creating the model:

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch
import time

embedder = SentenceTransformer('paraphrase-distilroberta-base-v1', device='cuda')
# embedder = SentenceTransformer('stsb-roberta-large', device='cuda')

# Initializing corpus

In [None]:
# Encoding the papers
corpus = list(CORD19Dictionary.keys())

# Creating the embeddings

In [None]:
%%time
# Encoding the papers
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True,show_progress_bar=True, device='cuda')

HBox(children=(FloatProgress(value=0.0, description='Batches', max=6793.0, style=ProgressStyle(description_wid…


CPU times: user 1min 45s, sys: 35.7 s, total: 2min 21s
Wall time: 2min 17s


Saving corpus as tensors to drive

In [None]:
corpus_embeddings_path = r"/content/drive/My Drive/AI_4/corpus_embeddings_"+number_of_papers+".pt"
torch.save(corpus_embeddings,corpus_embeddings_path)

# Loading embeddings if have been created and saved



---

In [None]:
corpus_embeddings_path = r"/content/drive/My Drive/AI_4/corpus_embeddings_"+number_of_papers+".pt"
with open(corpus_embeddings_path, 'rb') as f:
    corpus_embeddings = torch.load(f)

# Evaluation
---

Printing the top 3 answers and storing to the dataframe the first

In [None]:
import re
from  nltk import tokenize
from termcolor import colored

def paperTitle(answer,SentenceMap):
  record = SentenceMap[answer]
  print("Paper title:",record[1])
  print("Paper id:   ",record[0])  

# Find the closest sentence of the corpus for each query sentence based on cosine similarity
def evaluation(topk,query_list,corpus_embeddings,resultsDf):
  query_answers = []
  scores = []
  for query in query_list:
      query_embedding = embedder.encode(query, convert_to_tensor=True, device='cuda')

      # We use cosine-similarity and torch.topk to find the highest scores
      cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
      top_results = torch.topk(cos_scores, k=top_k)

      print("\n\n======================\n\n")
      print("Query:",colored(query,'green') )

      for iter, score, idx in zip(range(0,top_k),top_results[0], top_results[1]):
        print("\n-> ",iter+1)
        answer = ' '.join([re.sub(r"^\[.*\]", "", x) for x in corpus[idx].split()])
        if len(tokenize.word_tokenize(answer)) > 1:
          print("Score: {:.4f}".format(score))
          paperTitle(corpus[idx],CORD19Dictionary)
          print("Anser size: ",len(tokenize.word_tokenize(answer)))
          print("Anser: ")
          if iter==0:
            scores.append(score.item())
            query_answers.append(answer)
          print(colored(answer,'yellow'))
  resultsDf['Model_answer'] = query_answers
  resultsDf['Cosine_similarity'] = scores

In [None]:
  top_k = 3
  evaluation(top_k,query_list,corpus_embeddings,queriesDf)





Query: [32mWhat are the coronoviruses?[0m

->  1
Score: 0.6327
Paper title: Revisiting the dangers of the coronavirus in the ophthalmology practice
Paper id:    804a9591280b7f64fa79cd3e4a9358976b084ffb
Anser size:  6
Anser: 
[33mCoronaviruses: what are they?[0m

->  2
Score: 0.6319
Paper title: The sialic acid binding activity of the S protein facilitates infection by porcine transmissible gastroenteritis coronavirus
Paper id:    7902723eb8b21baa5eef8703832de11cc242a43b
Anser size:  10
Anser: 
[33mThese coronaviruses can be differentiated in three distinct groups.[0m

->  3
Score: 0.6172
Paper title: Avian viral surveillance in Victoria, Australia, and detection of two novel avian herpesviruses
Paper id:    8780e9524d9271a7b8e789f0f8c4eb6860ca8c50
Anser size:  3
Anser: 
[33mAvian coronaviruses.[0m




Query: [32mWhat was discovered in Wuhuan in December 2019?[0m

->  1
Score: 0.4792
Paper title: Morbidity and Mortality Weekly Report
Paper id:    11f13e2859eb22b349dbef68fd

In [None]:
  top_k = 3
  evaluation(top_k,myquery_list,corpus_embeddings,myQueriesDf)





Query: [32mHow long can the coronavirus survive on surfaces?[0m

->  1
Score: 0.6419
Paper title: Transcriptional profiling of feline infectious peritonitis virus infection in CRFK cells and in PBMCs from FIP diagnosed cats
Paper id:    d4b76917de146cdbe4c922c15ac3d87d3d483446
Anser size:  9
Anser: 
[33mThe pathogenesis of feline coronavirus infection is unclear.[0m

->  2
Score: 0.6112
Paper title: The Mongoose, the Pheasant, the Pox, and the Retrovirus
Paper id:    06c172a11ee30931ac537dfb70ce020dced50918
Anser size:  10
Anser: 
[33mHow then did this mammalian retrovirus get into birds?[0m

->  3
Score: 0.5973
Paper title: Analysis of the codon usage pattern in Middle East Respiratory Syndrome Coronavirus
Paper id:    627ada5c21fb8d0e43b37999fa66bf41ca36c353
Anser size:  13
Anser: 
[33mThis may hint that coronavirus does not spread so widely in humans.[0m




Query: [32mWhat means COVID-19?[0m

->  1
Score: 0.6750
Paper title: First two months of the 2019 Coronavirus Di

# Overall results

## 6000 papers with paraphrase-distilroberta-base-v1 model and no summarization 
---

### Time needed for creating the embeddings: 
- CPU times: 
  - user 17min 51s
  - sys: 11min 38s
  - total: 29min 29s
- Wall time: 29min 11s

### Remarks
Some questions have been anwered but I see a lot of small answers that I shouldn't have in the corpus and certainly don't answer the questions. Eliminating small sentences might have better results.

__Top-k__ 

I can obviously see that there are some answers in the top-3 that fit better than the first and best answer.




### Results

In [None]:
with pd.option_context('display.max_colwidth', None):
  display(queriesDf)

Unnamed: 0,Query,Proposed_answer,Model_answer,Cosine_similarity
0,What are the coronoviruses?,Coronaviruses (CoVs) are common human and animal pathogens that can transmit zoonotically and cause severe respiratory disease syndromes.,Avian coronaviruses.,0.61721
1,What was discovered in Wuhuan in December 2019?,"In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.","Recently, in December 2019, an outbreak of unusual pneumonia caused by unknown infection was reported in Wuhan, China .",0.541798
2,What is Coronovirus Disease 2019?,"Coronavirus Disease 2019 (COVID-19) is an emerging disease with a rapid increase in cases and deaths since its first identification in Wuhan, China, in December 2019.",Coronavirus.,0.675533
3,What is COVID-19?,COVID-19 is a viral respiratory illness caused by a new coronavirus called SARS-CoV-2.,"The first three imported cases of COVID-19 in France, the first ones in Europe, were diagnosed 14 days later, on 24 January.",0.515843
4,What is caused by SARS-COV2?,Coronavirus disease (COVID-19) is caused by SARS-COV2 and represents the causative agent of a potentially fatal disease that is of great global public health concern.,"The SARS outbreak of 2002-2003 was caused by SARS-CoV, a novel coronavirus.",0.615217
5,How is COVID-19 spread?,"First, although COVID-19 is spread by the airborne route, air disinfection of cities and communities is not known to be effective for disease control and needs to be stopped.","The international expansion of COVID-19 cases has led to widespread adoption of symptom and risk screening measures, in travel-associated and other contexts, and programs may still be adopted or expanded as source epidemics of COVID-19 emerge in new geographic areas.",0.542352
6,Where was COVID-19 discovered?,"In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.",No identified contact of the three cases has been confirmed with COVID-19.,0.530918
7,How does coronavirus spread?,"The new coronavirus was reported to spread via droplets, contact and natural aerosols from human-to-human.",This may hint that coronavirus does not spread so widely in humans.,0.670806


In [None]:
with pd.option_context('display.max_colwidth', None):
  display(myQueriesDf)

Unnamed: 0,Query,Model_answer,Cosine_similarity
0,How long can the coronavirus survive on surfaces?,"These data imply that the coronavirus can persist in its host for at least the duration of hibernation, particularly as nucleotide variability among the detected coronavirus isolates showed that spread of coronavirus among bats within a chamber was unlikely 32 .",0.647764
1,What means COVID-19?,"In the context of COVID-19, we consider both growing and stable epidemic scenarios, but place greater emphasis on the realistic assumption that the COVID-19 epidemic is still growing.",0.522382
2,Is COVID19 worse than flue?,"The first three imported cases of COVID-19 in France, the first ones in Europe, were diagnosed 14 days later, on 24 January.",0.476099
3,When the vaccine will be ready?,It is unknown if the vaccines will work.,0.673893
4,Whats the proteins that consist COVID-19?,Is there one main form or multiple for each of these proteins?,0.583473
5,Whats the symptoms of COVID-19?,"The true fraction of subclinical COVID-19 cases remains unknown, but anecdotally, many lab-confirmed COVID-19 cases have not shown detectable symptoms on diagnosis (Hoehl et al., 2020; Nishiura et al., 2020; .",0.630991
6,How can I prevent COVID-19?,"At the same time, there is great concern about potential public health consequences if COVID-19 spreads to developing countries that lack health infrastructure and resources to combat it effectively (de Salazar et al., 2020) .",0.521944
7,What treatments are available for COVID-19?,"The true fraction of subclinical COVID-19 cases remains unknown, but anecdotally, many lab-confirmed COVID-19 cases have not shown detectable symptoms on diagnosis (Hoehl et al., 2020; Nishiura et al., 2020; .",0.600732
8,Is hand sanitizer effective against COVID-19?,Is a mandatory OPV enforced by criminal sanctions the least autonomy-infringing intervention?,0.526254
9,Am I at risk for serious complications from COVID-19 if I smoke cigarettes?,"Does this level of RSV RNA in the air pose a risk to patients, staff or visitors in the vicinity?",0.535259


## 9000 papers with paraphrase-distilroberta-base-v1 model and no summarization 
---

Session crashed after using all available RAM, this happened every time I tried to run it with all the papers and no summarization


## 6000 papers with paraphrase-distilroberta-base-v1 model and summarization 
---

### Time needed for creating the embeddings: 
- CPU times: 
  - user 1min 12s
  - sys: 25.4 s
  - total: 1min 37s
- Wall time: 1min 42s

### Remarks
Same not good results. I also notice from these results that there are lots of answers that are questions from the papers. In my understanding, these answers occur due to the questionmark. I thought of two ways of improving this problem but I had no time re running all these notebooks (due to CUDA mostly):

- Either by removing the questionmark in pre-process. As fas as I searched it this isn't a very good solution for tackling this problem.
- Either by removing the all the sentences with a questionmark (questions) in pre-process. This I believe whould be the best solution, but I hadn't time to check it.

__Top-k__

Again lots of good answers were in the top-2 or top-3.




### Results

In [None]:
with pd.option_context('display.max_colwidth', None):
  display(queriesDf)

Unnamed: 0,Query,Proposed_answer,Model_answer,Cosine_similarity
0,What are the coronoviruses?,Coronaviruses (CoVs) are common human and animal pathogens that can transmit zoonotically and cause severe respiratory disease syndromes.,Avian coronaviruses.,0.61721
1,What was discovered in Wuhuan in December 2019?,"In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.","Recently, in December 2019, an outbreak of unusual pneumonia caused by unknown infection was reported in Wuhan, China .",0.541798
2,What is Coronovirus Disease 2019?,"Coronavirus Disease 2019 (COVID-19) is an emerging disease with a rapid increase in cases and deaths since its first identification in Wuhan, China, in December 2019.",Coronavirus.,0.675533
3,What is COVID-19?,COVID-19 is a viral respiratory illness caused by a new coronavirus called SARS-CoV-2.,"The first three imported cases of COVID-19 in France, the first ones in Europe, were diagnosed 14 days later, on 24 January.",0.515843
4,What is caused by SARS-COV2?,Coronavirus disease (COVID-19) is caused by SARS-COV2 and represents the causative agent of a potentially fatal disease that is of great global public health concern.,"The SARS outbreak of 2002-2003 was caused by SARS-CoV, a novel coronavirus.",0.615217
5,How is COVID-19 spread?,"First, although COVID-19 is spread by the airborne route, air disinfection of cities and communities is not known to be effective for disease control and needs to be stopped.","The international expansion of COVID-19 cases has led to widespread adoption of symptom and risk screening measures, in travel-associated and other contexts, and programs may still be adopted or expanded as source epidemics of COVID-19 emerge in new geographic areas.",0.542352
6,Where was COVID-19 discovered?,"In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.",No identified contact of the three cases has been confirmed with COVID-19.,0.530918
7,How does coronavirus spread?,"The new coronavirus was reported to spread via droplets, contact and natural aerosols from human-to-human.",This may hint that coronavirus does not spread so widely in humans.,0.670806


In [None]:
with pd.option_context('display.max_colwidth', None):
  display(myQueriesDf)

Unnamed: 0,Query,Model_answer,Cosine_similarity
0,How long can the coronavirus survive on surfaces?,This may hint that coronavirus does not spread so widely in humans.,0.59728
1,What means COVID-19?,"established for HCoV-OC43 ns12.9 (25), HCoV-229E 4a (24) , and PEDV 3 (26) .",0.465204
2,Is COVID19 worse than flue?,So why do Zika and Dengue NS2B have such radical differences in conformations and dynamics?,0.443893
3,When the vaccine will be ready?,And is the vaccine likely to be safe?,0.742244
4,Whats the proteins that consist COVID-19?,Is there one main form or multiple for each of these proteins?,0.583473
5,Whats the symptoms of COVID-19?,What is the clinical profile of co-infected patients?,0.576397
6,How can I prevent COVID-19?,So how can the COPII machinery manage to transport large viral agglomerates?,0.480076
7,What treatments are available for COVID-19?,What is the clinical profile of co-infected patients?,0.502994
8,Is hand sanitizer effective against COVID-19?,the effectiveness of handwashing and alcohol-based hand sanitizers.,0.52635
9,Am I at risk for serious complications from COVID-19 if I smoke cigarettes?,Do you smoke tobacco?,0.633407


## 9000 papers with paraphrase-distilroberta-base-v1 model and summarization 
---

CPU times: user , sys: , total: 
Wall time: 

### Time needed for creating the embeddings: 
- CPU times: 
  - user 1min 45s
  - sys: 35.7 s
  - total: 2min 21s
- Wall time: 2min 17s

### Remarks
Despite increasing the corpus (summarized) with 3000 papers more, I see worse results from before. Same problem, answering with questions or irrelevant answers. Can't understand what exactly happens as this is the third time I fix pre-process and re-run my models.



### Results

In [None]:
with pd.option_context('display.max_colwidth', None):
  display(queriesDf)

Unnamed: 0,Query,Proposed_answer,Model_answer,Cosine_similarity
0,What are the coronoviruses?,Coronaviruses (CoVs) are common human and animal pathogens that can transmit zoonotically and cause severe respiratory disease syndromes.,Coronaviruses: what are they?,0.63273
1,What was discovered in Wuhuan in December 2019?,"In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.","Nine of the first 11 U.S. 2019-nCoV patients were exposed in Wuhan, China.",0.479242
2,What is Coronovirus Disease 2019?,"Coronavirus Disease 2019 (COVID-19) is an emerging disease with a rapid increase in cases and deaths since its first identification in Wuhan, China, in December 2019.",Coronavirus.,0.675533
3,What is COVID-19?,COVID-19 is a viral respiratory illness caused by a new coronavirus called SARS-CoV-2.,This is particularly true for the COVID-19.,0.665808
4,What is caused by SARS-COV2?,Coronavirus disease (COVID-19) is caused by SARS-COV2 and represents the causative agent of a potentially fatal disease that is of great global public health concern.,Is there an effective specific anti-SARS-CoV-2 solution?,0.661237
5,How is COVID-19 spread?,"First, although COVID-19 is spread by the airborne route, air disinfection of cities and communities is not known to be effective for disease control and needs to be stopped.",The COVID-19 has then rapidly spread to all over China and the world.,0.625521
6,Where was COVID-19 discovered?,"In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.",The main dataset of this study is COVID-19 dataset.,0.595511
7,How does coronavirus spread?,"The new coronavirus was reported to spread via droplets, contact and natural aerosols from human-to-human.",The pathogenesis of feline coronavirus infection is unclear.,0.679283


In [None]:
with pd.option_context('display.max_colwidth', None):
  display(myQueriesDf)

Unnamed: 0,Query,Model_answer,Cosine_similarity
0,How long can the coronavirus survive on surfaces?,The pathogenesis of feline coronavirus infection is unclear.,0.641915
1,What means COVID-19?,This is particularly true for the COVID-19.,0.675011
2,Is COVID19 worse than flue?,This is particularly true for the COVID-19.,0.520135
3,When the vaccine will be ready?,And is the vaccine likely to be safe?,0.742244
4,Whats the proteins that consist COVID-19?,Is there one main form or multiple for each of these proteins?,0.583473
5,Whats the symptoms of COVID-19?,What is the percentage of COVID-19 patients have been infected with SARS and produced antibodies?,0.636812
6,How can I prevent COVID-19?,reported that people who have not been exposed to SARS-CoV-2 are all susceptible to COVID-19 .,0.580536
7,What treatments are available for COVID-19?,It remains to be seen if this will be the case for COVID-19 as well.,0.591324
8,Is hand sanitizer effective against COVID-19?,the effectiveness of handwashing and alcohol-based hand sanitizers.,0.52635
9,Am I at risk for serious complications from COVID-19 if I smoke cigarettes?,Do you smoke tobacco?,0.633407


# References

[1] https://www.sbert.net/examples/applications/semantic-search/README.html

[2] https://colab.research.google.com/drive/12cn5Oo0v3HfQQ8Tv6-ukgxXSmT3zl35A?usp=sharing