<p align="center">
 <img src="http://www.di.uoa.gr/themes/corporate_lite/logo_el.png" title="Department of Informatics and Telecommunications - University of Athens"/> </p>

---
<h1 align="center"> 
  Artificial Intelligence
</h1>
<h1 align="center" > 
  Deep Learning for Natural Language Processing
</h1>

---
<h2 align="center"> 
 <b>Konstantinos Nikoletos</b>
</h2>

<h3 align="center"> 
 <b>Winter 2020-2021</b>
</h3>


---
---



### __Task__ 
This exercise is about developing a document retrieval system to return titles of scientific
papers containing the answer to a given user question. You will use the first version of
the COVID-19 Open Research Dataset (CORD-19) in your work (articles in the folder
comm use subset).


For example, for the question “What are the coronaviruses?”, your system can return the
paper title “Distinct Roles for Sialoside and Protein Receptors in Coronavirus Infection”
since this paper contains the answer to the asked question.


To achieve the goal of this exercise, you will need first to read the paper Sentence-BERT:
Sentence Embeddings using Siamese BERT-Networks, in order to understand how you
can create sentence embeddings. In the related work of this paper, you will also find other
approaches for developing your model. For example, you can using Glove embeddings,
etc. In this link, you can find the extended versions of this dataset to test your model, if
you want. You are required to:


<ol type="a">
  <li>Preprocess the provided dataset. You will decide which data of each paper is useful
to your model in order to create the appropriate embeddings. You need to explain
your decisions.</li>
  <li>Implement at least 2 different sentence embedding approaches (see the related work
of the Sentence-BERT paper), in order for your model to retrieve the titles of the
papers related to a given question.</li>
  <li>Compare your 2 models based on at least 2 different criteria of your choice. Explain
why you selected these criteria, your implementation choices, and the results. Some
questions you can pose are included here. You will need to provide the extra questions
you posed to your model and the results of all the questions as well.</li>
</ol>

### __Notebook__ 


In this notebook I am going to create the Embeddings using Doc2Vec


---
---

__Import__ of essential libraries


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 
import sys # only needed to determine Python version number
import matplotlib # only needed to determine Matplotlib version 
import nltk
from nltk.stem import WordNetLemmatizer
import pprint
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext import data
import logging
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

Selecting device (GPU - CUDA if available)

In [None]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


# Loading data
---

In [None]:
# Opening data file
import io
from google.colab import drive
from os import listdir
from os.path import isfile, join
import json

drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


Loading the dictionary if it has been created

In [None]:
#@title Select number of papers that will be feeded in the model { vertical-output: true, display-mode: "both" }
number_of_papers = "9000" #@param ["1000", "3000", "6000", "9000"]
import pickle

CORD19_Dataframe = r"/content/drive/My Drive/AI_4/CORD19_SentenceMap_"+number_of_papers+".pkl"
with open(CORD19_Dataframe, 'rb') as drivef:
  CORD19Dictionary = pickle.load(drivef)

OR the summary of the papers

In [None]:
#@title Select number of summarized papers that will be feeded in the model { vertical-output: true, display-mode: "both" }
number_of_papers = "9000" #@param ["1000", "3000", "6000", "9000"]
import pickle

CORD19_Dataframe = r"/content/drive/My Drive/AI_4/CORD19_SentenceMap_Summarized_"+number_of_papers+".pkl"
with open(CORD19_Dataframe, 'rb') as drivef:
  CORD19Dictionary = pickle.load(drivef)

## Queries
---

In [None]:
query_list = [
  'What are the coronoviruses?',
  'What was discovered in Wuhuan in December 2019?',
  'What is Coronovirus Disease 2019?',
  'What is COVID-19?',
  'What is caused by SARS-COV2?', 'How is COVID-19 spread?',
  'Where was COVID-19 discovered?','How does coronavirus spread?'
]

proposed_answers = [
  'Coronaviruses (CoVs) are common human and animal pathogens that can transmit zoonotically and cause severe respiratory disease syndromes. ',
  'In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.',
  'Coronavirus Disease 2019 (COVID-19) is an emerging disease with a rapid increase in cases and deaths since its first identification in Wuhan, China, in December 2019.',
  'COVID-19 is a viral respiratory illness caused by a new coronavirus called SARS-CoV-2.',
  'Coronavirus disease (COVID-19) is caused by SARS-COV2 and represents the causative agent of a potentially fatal disease that is of great global public health concern.', 
  'First, although COVID-19 is spread by the airborne route, air disinfection of cities and communities is not known to be effective for disease control and needs to be stopped.',
  'In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.',
  'The new coronavirus was reported to spread via droplets, contact and natural aerosols from human-to-human.'
]

myquery_list = [
  "How long can the coronavirus survive on surfaces?",
  "What means COVID-19?",
  "Is COVID19 worse than flue?",
  "When the vaccine will be ready?",
  "Whats the proteins that consist COVID-19?",
  "Whats the symptoms of COVID-19?",
  "How can I prevent COVID-19?",
  "What treatments are available for COVID-19?",
  "Is hand sanitizer effective against COVID-19?",
  "Am I at risk for serious complications from COVID-19 if I smoke cigarettes?",
  "Are there any FDA-approved drugs (medicines) for COVID-19?",
  "How are people tested?",
  "Why is the disease being called coronavirus disease 2019, COVID-19?",
  "Am I at risk for COVID-19 from mail, packages, or products?",
  "What is community spread?",
  "How can I protect myself?",
  "What is a novel coronavirus?",
  "Was Harry Potter a good magician?"
]

# Results dataframes

In [None]:
resultsDf = pd.DataFrame(columns=['Number of papers','Embeddings creation time'])

queriesDf = pd.DataFrame(columns=['Query','Proposed_answer','Model_answer','Cosine_similarity'])
queriesDf['Query'] = query_list
queriesDf['Proposed_answer'] = proposed_answers

myQueriesDf = pd.DataFrame(columns=['Query','Model_answer','Cosine_similarity'])
myQueriesDf['Query'] = myquery_list

queriesDf

Unnamed: 0,Query,Proposed_answer,Model_answer,Cosine_similarity
0,What are the coronoviruses?,Coronaviruses (CoVs) are common human and anim...,,
1,What was discovered in Wuhuan in December 2019?,"In December 2019, a novel coronavirus, called ...",,
2,What is Coronovirus Disease 2019?,Coronavirus Disease 2019 (COVID-19) is an emer...,,
3,What is COVID-19?,COVID-19 is a viral respiratory illness caused...,,
4,What is caused by SARS-COV2?,Coronavirus disease (COVID-19) is caused by SA...,,
5,How is COVID-19 spread?,"First, although COVID-19 is spread by the airb...",,
6,Where was COVID-19 discovered?,"In December 2019, a novel coronavirus, called ...",,
7,How does coronavirus spread?,The new coronavirus was reported to spread via...,,


# Doc2Vec
---

An extension of Word2Vec, the Doc2Vec embedding is one of the most popular techniques out there. Introduced in 2014, it is an unsupervised algorithm and adds on to the Word2Vec model by introducing another ‘paragraph vector’. Also, there are 2 ways to add the paragraph vector to the model.

1. PVDM(Distributed Memory version of Paragraph Vector): We assign a paragraph vector sentence while sharing word vectors among all sentences. Then we either average or concatenate the (paragraph vector and words vector) to get the final sentence representation. If you notice, it is an extension of the Continuous Bag-of-Word type of Word2Vec where we predict the next word given a set of words. It is just that in PVDM, we predict the next sentence given a set of sentences.

<p align="center">
 <img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/08/PVDM.png" title="Department of Informatics and Telecommunications - University of Athens"/>
 <p  align="center">Infersent Flow</p>
</p>

2. PVDOBW( Distributed Bag of Words version of Paragraph Vector): Just lime PVDM, PVDOBW is another extension, this time of the Skip-gram type. Here, we just sample random words from the sentence and make the model predict which sentence it came from(a classification task).

<p align="center">
 <img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/08/PVDOBW.png" title="Department of Informatics and Telecommunications - University of Athens"/>
 <p  align="center">Infersent Flow</p>
</p>

The authors of the paper recommend using both in combination, but state that usually PVDM is more than enough for most tasks.

In [None]:
# from sentence_transformers import SentenceTransformer, util
import torch
import time
from nltk import word_tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Initializing corpus

In [None]:
# Encoding the papers
corpus = list(CORD19Dictionary.keys())

# Creating the embeddings


```
vector_size = Dimensionality of the feature vectors.
window = The maximum distance between the current and predicted word within a sentence.
min_count = Ignores all words with total frequency lower than this.
alpha = The initial learning rate.
```

In [None]:
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(corpus)]

In [None]:
%%time
model = Doc2Vec(tagged_data, vector_size = 20, window = 2, min_count = 1)

CPU times: user 2min 44s, sys: 21.9 s, total: 3min 5s
Wall time: 2min 21s


# Evaluation
---


In [None]:
import re
from  nltk import tokenize
from termcolor import colored

def paperTitle(answer,SentenceMap):
  record = SentenceMap[answer]
  print("Paper title:",record[1])
  print("Paper id:   ",record[0])  

# Find the closest sentence of the corpus for each query sentence based on cosine similarity
def evaluation(topk,query_list,model,resultsDf):
  query_answers = []
  scores = []
  for query in query_list:

      test_doc = word_tokenize(query.lower())
      test_doc_vector = model.infer_vector(test_doc)


      most_similar = model.docvecs.most_similar(positive = [test_doc_vector])

      print("\n\n======================\n\n")
      print("Query:",colored(query,'green') )

      for k, (idx, score) in zip(range(0,topk),most_similar):
        print("\n-> ",k+1)
        answer = ' '.join([re.sub(r"^\[.*\]", "", x) for x in corpus[idx].split()])
        print("Score: {:.4f}".format(score))
        paperTitle(corpus[idx],CORD19Dictionary)
        print("Anser size: ",len(tokenize.word_tokenize(answer)))
        print("Anser: ")
        print(colored(answer,'yellow'))
        if(k == 0):
          scores.append(score)
          query_answers.append(answer)
  resultsDf['Model_answer'] = query_answers
  resultsDf['Cosine_similarity'] = scores

In [None]:
  top_k = 3
  evaluation(top_k,query_list,model,queriesDf)





Query: [32mWhat are the coronoviruses?[0m

->  1
Score: 0.7967
Paper title: Bacillus subtilis and surfactin inhibit the transmissible gastroenteritis virus from entering the intestinal epithelial cells
Paper id:    603e11d248463c9b79a1030663946c5640810536
Anser size:  19
Anser: 
[33mBinding to the cellular receptor is the first step of CoV entry process [22, 23] .[0m

->  2
Score: 0.7912
Paper title: 
Paper id:    b3574701bdaf6c4d408e9abe2d4176e058fc5208
Anser size:  6
Anser: 
[33m2009; Peel et al.[0m

->  3
Score: 0.7907
Paper title: Climate Change Could Increase the Geographic Extent of Hendra Virus Spillover Risk
Paper id:    62df738127972a011de37d5c27776c48442889a9
Anser size:  6
Anser: 
[33m2008; Wiens et al.[0m




Query: [32mWhat was discovered in Wuhuan in December 2019?[0m

->  1
Score: 0.8092
Paper title: Climate Change Could Increase the Geographic Extent of Hendra Virus Spillover Risk
Paper id:    62df738127972a011de37d5c27776c48442889a9
Anser size:  6
Anser: 

In [None]:
  top_k = 3
  evaluation(top_k,myquery_list,model,myQueriesDf)





Query: [32mHow long can the coronavirus survive on surfaces?[0m

->  1
Score: 0.8004
Paper title: 
Paper id:    b3574701bdaf6c4d408e9abe2d4176e058fc5208
Anser size:  6
Anser: 
[33m2009; Peel et al.[0m

->  2
Score: 0.7923
Paper title: Zoonotic parasites of dromedary camels: so important, so ignored
Paper id:    6c4fd19f82c72d0d52f0de8c0290f522646c6da7
Anser size:  12
Anser: 
[33mCE causes considerable medical costs and economic losses in endemic areas.[0m

->  3
Score: 0.7862
Paper title: Climate Change Could Increase the Geographic Extent of Hendra Virus Spillover Risk
Paper id:    62df738127972a011de37d5c27776c48442889a9
Anser size:  6
Anser: 
[33m2008; Wiens et al.[0m




Query: [32mWhat means COVID-19?[0m

->  1
Score: 0.7932
Paper title: Contact chains of cattle farms in Great Britain
Paper id:    e1d7e7883ad7d77a1c982c700f77e5ce09530fcf
Anser size:  10
Anser: 
[33m-We have clarified in the methods at lines 180-191.[0m

->  2
Score: 0.7803
Paper title: Potential Ra

# Overall results

## 9000 papers with no summarization 
---

### Time needed for creating the embeddings: 
- CPU times: 
  - user 25min 13s
  - sys: 2min 35s
  - total: 27min 48s
- Wall time: 19min 20s

### Remarks
Terrible results, I believe that I am doing something wrong in the way I use these embeddings. No remarks need to be checked again but didn't have the time. 



### Results

In [None]:
with pd.option_context('display.max_colwidth', None):
  display(queriesDf)

Unnamed: 0,Query,Proposed_answer,Model_answer,Cosine_similarity
0,What are the coronoviruses?,Coronaviruses (CoVs) are common human and animal pathogens that can transmit zoonotically and cause severe respiratory disease syndromes.,"5 In these constructs, the CD8 + HLA-A03-11 supertypes-restricted epitopes were linked by N/KAAA spacers.",0.879537
1,What was discovered in Wuhuan in December 2019?,"In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.",Nineteen 2-ml fractions were collected from the bottom of the cushion.,0.806609
2,What is Coronovirus Disease 2019?,"Coronavirus Disease 2019 (COVID-19) is an emerging disease with a rapid increase in cases and deaths since its first identification in Wuhan, China, in December 2019.",Pathway classification according to canonical pathways was performed using IPA software.,0.834787
3,What is COVID-19?,COVID-19 is a viral respiratory illness caused by a new coronavirus called SARS-CoV-2.,Two Zn 2+ binding sites have been identified in NS5 RdRP crystal structures.,0.748369
4,What is caused by SARS-COV2?,Coronavirus disease (COVID-19) is caused by SARS-COV2 and represents the causative agent of a potentially fatal disease that is of great global public health concern.,Nineteen 2-ml fractions were collected from the bottom of the cushion.,0.830321
5,How is COVID-19 spread?,"First, although COVID-19 is spread by the airborne route, air disinfection of cities and communities is not known to be effective for disease control and needs to be stopped.",Pathway classification according to canonical pathways was performed using IPA software.,0.833483
6,Where was COVID-19 discovered?,"In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.",Nineteen 2-ml fractions were collected from the bottom of the cushion.,0.825443
7,How does coronavirus spread?,"The new coronavirus was reported to spread via droplets, contact and natural aerosols from human-to-human.",Two Zn 2+ binding sites have been identified in NS5 RdRP crystal structures.,0.755793


In [None]:
with pd.option_context('display.max_colwidth', None):
  display(myQueriesDf)

Unnamed: 0,Query,Model_anser,Cosine_similarity,Model_answer
0,How long can the coronavirus survive on surfaces?,,0.814669,Nineteen 2-ml fractions were collected from the bottom of the cushion.
1,What means COVID-19?,,0.838306,Pathway classification according to canonical pathways was performed using IPA software.
2,Is COVID19 worse than flue?,,0.73619,"During ascent, an off set appeared for all respiratory parameters: Vt increased by 59% and PmEI by 53% between 0 and 8,000 ft. During descent, the off set was reversely directed with a 39% decrease in Vt and a 28% decrease in PmEE between 8,000 and 0 ft. Modifying working pressure adequately corrected PmEI and PmEE, but not Vt."
3,When the vaccine will be ready?,,0.831917,Pathway classification according to canonical pathways was performed using IPA software.
4,Whats the proteins that consist COVID-19?,,0.842919,Nineteen 2-ml fractions were collected from the bottom of the cushion.
5,Whats the symptoms of COVID-19?,,0.743365,"During ascent, an off set appeared for all respiratory parameters: Vt increased by 59% and PmEI by 53% between 0 and 8,000 ft. During descent, the off set was reversely directed with a 39% decrease in Vt and a 28% decrease in PmEE between 8,000 and 0 ft. Modifying working pressure adequately corrected PmEI and PmEE, but not Vt."
6,How can I prevent COVID-19?,,0.758356,Two Zn 2+ binding sites have been identified in NS5 RdRP crystal structures.
7,What treatments are available for COVID-19?,,0.825538,Pathway classification according to canonical pathways was performed using IPA software.
8,Is hand sanitizer effective against COVID-19?,,0.841182,Pathway classification according to canonical pathways was performed using IPA software.
9,Am I at risk for serious complications from COVID-19 if I smoke cigarettes?,,0.822572,Pathway classification according to canonical pathways was performed using IPA software.


## 9000 papers and summarization 
---

### Time needed for creating the embeddings: 
- CPU times: 
  - user 17min 51s
  - sys: 11min 38s
  - total: 29min 29s
- Wall time: 29min 11s

### Remarks

Same as above.



### Results

In [None]:
with pd.option_context('display.max_colwidth', None):
  display(queriesDf)

Unnamed: 0,Query,Proposed_answer,Model_answer,Cosine_similarity
0,What are the coronoviruses?,Coronaviruses (CoVs) are common human and animal pathogens that can transmit zoonotically and cause severe respiratory disease syndromes.,"Binding to the cellular receptor is the first step of CoV entry process [22, 23] .",0.796678
1,What was discovered in Wuhuan in December 2019?,"In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.",2008; Wiens et al.,0.809173
2,What is Coronovirus Disease 2019?,"Coronavirus Disease 2019 (COVID-19) is an emerging disease with a rapid increase in cases and deaths since its first identification in Wuhan, China, in December 2019.",The medium was replenished every 24 h.,0.837877
3,What is COVID-19?,COVID-19 is a viral respiratory illness caused by a new coronavirus called SARS-CoV-2.,RNA was stored at 80 C until use.,0.773054
4,What is caused by SARS-COV2?,Coronavirus disease (COVID-19) is caused by SARS-COV2 and represents the causative agent of a potentially fatal disease that is of great global public health concern.,2008; Wiens et al.,0.7959
5,How is COVID-19 spread?,"First, although COVID-19 is spread by the airborne route, air disinfection of cities and communities is not known to be effective for disease control and needs to be stopped.",2008; Wiens et al.,0.789324
6,Where was COVID-19 discovered?,"In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.",VIRSorter_NODE_25_length_14198_cov_12_3353-cat_2,0.838882
7,How does coronavirus spread?,"The new coronavirus was reported to spread via droplets, contact and natural aerosols from human-to-human.",2008; Wiens et al.,0.8099


In [None]:
with pd.option_context('display.max_colwidth', None):
  display(myQueriesDf)

Unnamed: 0,Query,Model_answer,Cosine_similarity
0,How long can the coronavirus survive on surfaces?,2009; Peel et al.,0.800417
1,What means COVID-19?,-We have clarified in the methods at lines 180-191.,0.793196
2,Is COVID19 worse than flue?,2008; Wiens et al.,0.799016
3,When the vaccine will be ready?,The medium was replenished every 24 h.,0.833728
4,Whats the proteins that consist COVID-19?,2008; Wiens et al.,0.779292
5,Whats the symptoms of COVID-19?,2008; Wiens et al.,0.809461
6,How can I prevent COVID-19?,2008; Wiens et al.,0.811007
7,What treatments are available for COVID-19?,2008; Wiens et al.,0.761357
8,Is hand sanitizer effective against COVID-19?,2008; Wiens et al.,0.810513
9,Am I at risk for serious complications from COVID-19 if I smoke cigarettes?,2009; Peel et al.,0.824034


# References

[1] https://colab.research.google.com/drive/1l6stpYdRMmeDBK_vw0L5NitdiAuhdsAr?usp=sharing#scrollTo=D_hDi8KzNgMM

[2] https://www.sbert.net/docs/package_reference/cross_encoder.html

[3] 