<p align="center">
 <img src="http://www.di.uoa.gr/themes/corporate_lite/logo_el.png" title="Department of Informatics and Telecommunications - University of Athens"/> </p>

---
<h1 align="center"> 
  Artificial Intelligence
</h1>
<h1 align="center" > 
  Deep Learning for Natural Language Processing
</h1>

---
<h2 align="center"> 
 <b>Konstantinos Nikoletos</b>
</h2>

<h3 align="center"> 
 <b>Winter 2020-2021</b>
</h3>


---
---


### __Task__ 
This exercise is about developing a document retrieval system to return titles of scientific
papers containing the answer to a given user question. You will use the first version of
the COVID-19 Open Research Dataset (CORD-19) in your work (articles in the folder
comm use subset).


For example, for the question “What are the coronaviruses?”, your system can return the
paper title “Distinct Roles for Sialoside and Protein Receptors in Coronavirus Infection”
since this paper contains the answer to the asked question.


To achieve the goal of this exercise, you will need first to read the paper Sentence-BERT:
Sentence Embeddings using Siamese BERT-Networks, in order to understand how you
can create sentence embeddings. In the related work of this paper, you will also find other
approaches for developing your model. For example, you can using Glove embeddings,
etc. In this link, you can find the extended versions of this dataset to test your model, if
you want. You are required to:


<ol type="a">
  <li>Preprocess the provided dataset. You will decide which data of each paper is useful
to your model in order to create the appropriate embeddings. You need to explain
your decisions.</li>
  <li>Implement at least 2 different sentence embedding approaches (see the related work
of the Sentence-BERT paper), in order for your model to retrieve the titles of the
papers related to a given question.</li>
  <li>Compare your 2 models based on at least 2 different criteria of your choice. Explain
why you selected these criteria, your implementation choices, and the results. Some
questions you can pose are included here. You will need to provide the extra questions
you posed to your model and the results of all the questions as well.</li>
</ol>

### __Notebook__

In this notebook I tried to create embeddings using InferSent but I got stucked to some errors that I finally didn't have the time to fix. I added this notebook it to the zip I handed as I made a try (unsuccesful) for InferSent embeddings. 



## NOT WORKING!!!

---
---

__Import__ of essential libraries


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 
import sys # only needed to determine Python version number
import matplotlib # only needed to determine Matplotlib version 
import nltk
from nltk.stem import WordNetLemmatizer
import pprint
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext import data
import logging
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

Selecting device (GPU - CUDA if available)

In [2]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


# Loading data
---

In [4]:
# Opening data file
import io
from google.colab import drive
from os import listdir
from os.path import isfile, join
import json

drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


Loading the dictionary if it has been created

In [5]:
#@title Select number of papers that will be feeded in the model { vertical-output: true, display-mode: "both" }
number_of_papers = "1000" #@param ["1000","3000","6000"]
import pickle

CORD19_Dataframe = r"/content/drive/My Drive/AI_4/CORD19_SentenceMap_"+number_of_papers+".pkl"
with open(CORD19_Dataframe, 'rb') as drivef:
  CORD19Dictionary = pickle.load(drivef)

## Queries
---

In [7]:
query_list = [
  'What are the coronoviruses?',
  'What was discovered in Wuhuan in December 2019?',
  'What is Coronovirus Disease 2019?',
  'What is COVID-19?',
  'What is caused by SARS-COV2?', 'How is COVID-19 spread?',
  'Where was COVID-19 discovered?','How does coronavirus spread?'
]

proposed_answers = [
  'Coronaviruses (CoVs) are common human and animal pathogens that can transmit zoonotically and cause severe respiratory disease syndromes. ',
  'In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.',
  'Coronavirus Disease 2019 (COVID-19) is an emerging disease with a rapid increase in cases and deaths since its first identification in Wuhan, China, in December 2019.',
  'COVID-19 is a viral respiratory illness caused by a new coronavirus called SARS-CoV-2.',
  'Coronavirus disease (COVID-19) is caused by SARS-COV2 and represents the causative agent of a potentially fatal disease that is of great global public health concern.', 
  'First, although COVID-19 is spread by the airborne route, air disinfection of cities and communities is not known to be effective for disease control and needs to be stopped.',
  'In December 2019, a novel coronavirus, called COVID-19, was discovered in Wuhan, China, and has spread to different cities in China as well as to 24 other countries.',
  'The new coronavirus was reported to spread via droplets, contact and natural aerosols from human-to-human.'
]

myquery_list = [
  "How long can the coronavirus survive on surfaces?",
  "What means COVID-19?",
  "Is COVID19 worse than flue?",
  "When the vaccine will be ready?",
  "Whats the proteins that consist COVID-19?",
  "Whats the symptoms of COVID-19?",
  "How can I prevent COVID-19?",
  "What treatments are available for COVID-19?",
  "Is hand sanitizer effective against COVID-19?",
  "Am I at risk for serious complications from COVID-19 if I smoke cigarettes?",
  "Are there any FDA-approved drugs (medicines) for COVID-19?",
  "How are people tested?",
  "Why is the disease being called coronavirus disease 2019, COVID-19?",
  "Am I at risk for COVID-19 from mail, packages, or products?",
  "What is community spread?",
  "How can I protect myself?",
  "What is a novel coronavirus?",
  "Was Harry Potter a good magician?"
]

# Results dataframes

In [8]:
resultsDf = pd.DataFrame(columns=['Number of papers','Embeddings creation time'])

queriesDf = pd.DataFrame(columns=['Query','Proposed_answer','Model_answer','Cosine_similarity'])
queriesDf['Query'] = query_list
queriesDf['Proposed_answer'] = proposed_answers

myQueriesDf = pd.DataFrame(columns=['Query','Model_anser','Cosine_similarity'])
myQueriesDf['Query'] = myquery_list

queriesDf

Unnamed: 0,Query,Proposed_answer,Model_answer,Cosine_similarity
0,What are the coronoviruses?,Coronaviruses (CoVs) are common human and anim...,,
1,What was discovered in Wuhuan in December 2019?,"In December 2019, a novel coronavirus, called ...",,
2,What is Coronovirus Disease 2019?,Coronavirus Disease 2019 (COVID-19) is an emer...,,
3,What is COVID-19?,COVID-19 is a viral respiratory illness caused...,,
4,What is caused by SARS-COV2?,Coronavirus disease (COVID-19) is caused by SA...,,
5,How is COVID-19 spread?,"First, although COVID-19 is spread by the airb...",,
6,Where was COVID-19 discovered?,"In December 2019, a novel coronavirus, called ...",,
7,How does coronavirus spread?,The new coronavirus was reported to spread via...,,


# InferSent
---

In [9]:
!pip install -U sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/6a/e2/84d6acfcee2d83164149778a33b6bdd1a74e1bcb59b2b2cd1b861359b339/sentence-transformers-0.4.1.2.tar.gz (64kB)
[K     |████████████████████████████████| 71kB 7.8MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/98/87/ef312eef26f5cecd8b17ae9654cdd8d1fae1eb6dbd87257d6d73c128a4d0/transformers-4.3.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 23.9MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/14/67/e42bd1181472c95c8cda79305df848264f2a7f62740995a46945d9797b67/sentencepiece-0.1.95-cp36-cp36m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 45.2MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/fd/5b/44baae602e0a30bcc53fbdbc60bd940c15e143d252d658dfdefce736ece5/tokenizers-0.10.1-cp36-cp3

Linking with ``` models.py ``` in Drive

In [10]:
!cp -r "/content/drive/My Drive/AI_4/models.py" '/content/'

In [11]:
%cd drive/My\ Drive/AI_4
!pwd

/content/drive/My Drive/AI_4
/content/drive/My Drive/AI_4


# Initializing and tuning InferSent model
---

## __InterSent__

Conneau et al. created a bi-directional LSTM with max-pooling over its outputs to generate sentence embeddings. This model was trained from scratch on both MG-NLI and SNLI datasets which highlight the first difference compared to SBERT. Since BERT is at the core of SBERT much of its language understanding comes from the language modeling pre-training task. SBERT used the MG-NLI and SNLI datasets for fine-tuning which should allow it to have a better understanding of language.
The LSTM model was able to achieve a test score 84.5 on the SNLI dataset, outperforming at the time best performing competitor with 1.1 points.


<p align="center">
 <img src="https://miro.medium.com/max/527/1*H5POIKHhmD-L_nLylV6MSQ.png" title="Department of Informatics and Telecommunications - University of Athens"/>
 <p  align="center">Hierarchical convolutional networks</p>
</p>


## __How Infersent works__

The architecture consists of 2 parts:
1. One is the sentence encoder that takes word vectors and encodes sentences into vectors
2. Two, an NLI classifier that takes the encoded vectors in and outputs a class among entailment, contradiction and neutral.

<p align="center">
 <img src="https://miro.medium.com/max/700/1*wbuFlMRo_NTqg8w52M8THw.png" title="Department of Informatics and Telecommunications - University of Athens"/>
 <p  align="center">Infersent Flow</p>
</p>

In [12]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Initializing corpus

In [36]:
corpus = list(CORD19Dictionary.keys())[:10]

In [37]:
from models import InferSent
import torch
from sentence_transformers import SentenceTransformer, util
import torch
import time

V = 2
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {'bsize': 32, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}

model = InferSent(params_model).to(device)
model.load_state_dict(torch.load(MODEL_PATH))

W2V_PATH = 'Glove/glove.6B.300d.txt'
model.set_w2v_path(W2V_PATH)

Creating a cosine similarity function

In [None]:
def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

Building vocabulary

In [38]:
model.build_vocab(corpus, tokenize=True)

Found 155(/177) words with w2v vectors
Vocab size : 155


# Creating the embeddings
Encoding the papers

In [41]:
%%time
corpus_embeddings = torch.Tensor(model.encode(corpus,tokenize=True,verbose=True))

Nb words kept : 294/342 (86.0%)
Speed : 202.4 sentences/s (gpu mode, bsize=64)
CPU times: user 29.7 ms, sys: 17.5 ms, total: 47.3 ms
Wall time: 49.6 ms


  sentences = np.array(sentences)[idx_sort]


## Saving corpus as tensors to drive

In [21]:
print("Saving corpus as tensors to drive")
corpus_embeddings_path = r"/content/drive/My Drive/AI_4/corpus_embeddings_"+number_of_papers+"_InferSent.pt"
torch.save(corpus_embeddings,corpus_embeddings_path)

Saving corpus as tensors to drive


# Loading embeddings if have been created and saved



---

In [23]:
corpus_embeddings_path = r"/content/drive/My Drive/AI_4/corpus_embeddings_"+number_of_papers+".pt"
with open(corpus_embeddings_path, 'rb') as f:
    corpus_embeddings = torch.load(f)

# Evaluation
---


In [34]:
import re
from  nltk import tokenize
from termcolor import colored

def paperTitle(answer,SentenceMap):
  record = SentenceMap[answer]
  print("Paper title:",record[1])
  print("Paper id:   ",record[0])  

# Find the closest sentence of the corpus for each query sentence based on cosine similarity
def evaluation(topk,query_list,corpus_embeddings,resultsDf):
  query_answers = []
  scores = []
  for query in query_list:
      query_embedding = embedder.encode(query,tokenize=True)

      # We use cosine-similarity and torch.topk to find the highest scores
      cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
      top_results = torch.topk(cos_scores, k=top_k)

      print("\n\n======================\n\n")
      print("Query:",colored(query,'green') )

      for iter, score, idx in zip(range(0,top_k),top_results[0], top_results[1]):
        answer = ' '.join([re.sub(r"^\[.*\]", "", x) for x in corpus[idx].split()])
        if len(tokenize.word_tokenize(answer)) > 1:
          print("Score: {:.4f}".format(score))
          paperTitle(corpus[idx],CORD19Dictionary)
          print("Anser size: ",len(tokenize.word_tokenize(answer)))
          print("Anser: ")
          if iter == 0:
            scores.append(score.item())
            query_answers.append(answer)
          print(colored(answer,'yellow'))
        break

  resultsDf['Model_answer'] = query_answers
  resultsDf['Cosine_similarity'] = scores

In [42]:
top_k = 3
embedder = model
evaluation(top_k,query_list,corpus_embeddings,queriesDf)

  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (sentences[i], i))
  Replacing by "</s>"..' % (senten

KeyError: ignored

In [None]:
top_k = 3
embedder = model
evaluation(top_k,myquery_list,corpus_embeddings,myQueriesDf)

# References

[1] [Oficcial InferSent GitHub repository](https://github.com/facebookresearch/InferSent)

[2] [Sentence embeddings examples](https://www.analyticsvidhya.com/blog/2020/08/top-4-sentence-embedding-techniques-using-python/)