<a href="https://colab.research.google.com/github/Myrto-Iglezou/AI2-project4/blob/master/Question_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## YΣ19 Artificial Intelligence II
# Homework 4

### Iglezou Myrto - 111520170038


<img src="https://venturebeat.com/wp-content/uploads/2020/03/CORD-19.png?w=1200&strip=all" alt="Cord-19" width="600"/>

# Project Description



The objective of this project is about developing a document retrieval system to return titles of scientific papers containing the answer to a given user question. The dataset used in this exercise is from [COVID-19 Open Research Dataset (CORD-19)](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html), the first version. We are gonna implement 2 different sentence embedding approaches, in order for the model to retrieve the titles of the papers related to a given question.


# **Question 1** 

## Step 1 - Preprocess the provided dataset

### Read all json files from folder and keep for dataset the title and the body. Then save the dataframe to a csv file for faster reading of the dataframe.

In [None]:
import io
import os
from google.colab import drive
import pandas as pd 
import json

drive.mount('/content/drive',force_remount=True)
path = r"/content/drive/My Drive/cord-19_2020-03-13/cord-19_2020-03-13/2020-03-13/comm_use_subset"

dataset_df = pd.DataFrame(columns=['id', 'title', 'body'])

for filename in os.listdir(path):
   with open(os.path.join(path, filename), 'r') as f:  
      json_text = json.load(f)

      id = json_text['paper_id']
      # print(id)
      title = json_text['metadata']['title']
      # print(title)
      body = json_text['body_text']
      # print(body)

      dataset_df.loc[len(dataset_df)] = [id,title,body]
   

Mounted at /content/drive


In [None]:
dataset_df.to_csv('dataset.csv',index=False)
!cp dataset.csv "drive/My Drive/"

### Read the dataset from the csv file and save the information in a dataframe

In [1]:
import io
from google.colab import drive
import pandas as pd 
import sys 

drive.mount('/content/drive',force_remount=True)
filePath = r"/content/drive/My Drive/dataset.csv"
dataset_df = pd.read_csv(filePath)
dataset_df.title = dataset_df.title.astype(str)  # make everything str, for lower() function

Mounted at /content/drive


**Dataframe before the preprocess**

In [2]:
dataset_df.head(5)

Unnamed: 0,id,title,body
0,236bd666a76213bc131969e1d5b66e410fc1cd45,MINI REVIEW Acute Phase Proteins in Marine Mam...,[{'text': 'The mammalian immune system include...
1,14374db205f6934d9cba148624462000bc6ec7be,Antibody Treatment against Angiopoietin-Like 4...,[{'text': 'IMPORTANCE Despite extensive global...
2,af678e8cd31d74cdb2d690addc19d59dca331f2b,Quantifying the seasonal drivers of transmissi...,"[{'text': ""Growing human population, urbanizat..."
3,42b049c2b5b32c094dc8b10f967e43ac2169b890,Evaluation of the influenza-like illness surve...,[{'text': 'the first evaluation of the Tunisia...
4,1664a9df618ca74e099245a2bd65f3172aeac284,,"[{'text': 'Worldwide, lung cancer remains the ..."


Remove some of the special characters, such as [, { , ' , : and some words form json like 'text'.

In [3]:
def removeCharacters(x):
  x = x.str.replace(r'\'text\'', '')
  x = x.str.replace(r'\'start\'', '')
  x = x.str.replace(r'\'end\'', '')
  x = x.str.replace(r'[{}]', '')
  x = x.str.replace(r'[<>]', '')
  x = x.str.replace(r'[\[\]]', '')
  x = x.str.replace(r'["]', '')
  x = x.str.replace(r'[\']', '')
  x = x.str.replace(r'[:]', '')
  x = x.str.replace(r'[()]', '')
  x = x.str.replace(r'[?]', '')
  return x

In [4]:
dataset_df['body'] = removeCharacters(dataset_df['body'])

Remove all the uppercase letters from title and body

In [5]:
dataset_df['body'] = dataset_df['body'].apply(lambda x: x.lower())

**Dataframe after the preprocess**

In [6]:
dataset_df.head(5)

Unnamed: 0,id,title,body
0,236bd666a76213bc131969e1d5b66e410fc1cd45,MINI REVIEW Acute Phase Proteins in Marine Mam...,the mammalian immune system includes innate o...
1,14374db205f6934d9cba148624462000bc6ec7be,Antibody Treatment against Angiopoietin-Like 4...,"importance despite extensive global efforts, ..."
2,af678e8cd31d74cdb2d690addc19d59dca331f2b,Quantifying the seasonal drivers of transmissi...,"growing human population, urbanization and gl..."
3,42b049c2b5b32c094dc8b10f967e43ac2169b890,Evaluation of the influenza-like illness surve...,the first evaluation of the tunisian influenz...
4,1664a9df618ca74e099245a2bd65f3172aeac284,,"worldwide, lung cancer remains the most frequ..."


In [None]:
%%capture
!pip install transformers

In [None]:
%%capture
!pip install sentence-transformers

In [7]:
import torch
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
    device = 'cuda'
else:
    print('No GPU available, training on CPU.')
    device = 'cpu'

Training on GPU.


## Questions

In [8]:
questions =  [
                  "What are the coronoviruses?",
                  "What was discovered in Wuhuan in December 2019?",
                  "What is Coronovirus Disease 2019?",
                  "What is COVID-19?",
                  "What is caused by SARS-COV2?",
                  "How is COVID-19 spread?",
                  "Where was COVID-19 discovered?",
                  "How does coronavirus spread?",
              ]

## Create the list of sentences



In [9]:
import nltk
from nltk import tokenize
nltk.download('punkt')

ListOfBodies = dataset_df['body'].apply(lambda x: tokenize.sent_tokenize(x))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Usefull functions for the models

In [10]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def most_similar(sentences, embeddings, query_embedding, k = 1):
    X = np.stack(embeddings)
    score_map = zip(sentences, cosine_similarity(X, query_embedding.reshape(1, -1)))
    return sorted(score_map, key=lambda v: v[1], reverse=True)[:k]

In [11]:
def ask_question(model,question,ListOfSentences,sentence_embeddings,dataset_df):
  query_vec = model.encode([question])[0]
  similar = most_similar(ListOfSentences,sentence_embeddings,query_vec)
  row = dataset_df[dataset_df['body'].str.contains(similar[0][0])]
  title = row['title'].tolist()[0]
  text = similar[0][0]

  return title,text

In [12]:
from termcolor import colored

def print_answer(question,title,text):

  print(colored("Question : ",attrs=['bold']),question)
  print("\n")
  print(colored("Title : ",attrs=['bold']),title)
  print("\n")
  print(colored("Text : ",attrs=['bold']),text)
  print("\n")
  # text = row['body'].apply(lambda x: tokenize.sent_tokenize(x))
  # num = 0
  # for t in text:
  #   for sentence in t:
  #     if sentence == l1:
  #         print(t[num-1], end=" ")
  #         print("\n")
  #         print(colored(sentence,'grey','on_yellow'), end=" ")
  #         print("\n")
  #         print(t[num+1], end=" ")
  #         num+=1
  # print("\n")

## Step 2.a - First sentence embedding approach

### SBERT Model

Uses Sentence-BERT (SBERT), a modification of the BERT network using siamese and triplet networks that is able to derive semantically meaningful sentence embeddings. This allows more efficient semantic search, which is utilized in the following application.

The siamese network architecture enables that fixed-sized vectors for input sentences can be derived. Using a similarity measure like cosine similarity or Manhatten / Euclidean distance, semantically similar sentences can be found. Cosine similarity is used in this work.

SBERT fine tuned on NLI data which creates SOTA sentence embeddings, as reported in the [SBERT paper](https://arxiv.org/pdf/1908.10084.pdf).

SBERT Framework example

<img src="https://combine.se/wp-content/uploads/2019/09/3.png" alt="Cord-19" width="300"/>

In [None]:
ListOfSentences = []
numOfArticles = 0
for text in ListOfBodies:
  ListOfSentences += text
  numOfArticles+=1
  if(numOfArticles == 4000):
    break

In [None]:
import torch
from sentence_transformers import SentenceTransformer

sbert_model = SentenceTransformer('bert-base-nli-mean-tokens',device=device)

100%|██████████| 405M/405M [00:14<00:00, 27.7MB/s]


In [None]:
import time

start_time = time.time()

sentence_embeddings = sbert_model.encode(ListOfSentences,device=device)

for question in questions: 
  title,text = ask_question(sbert_model,question,ListOfSentences,sentence_embeddings,dataset_df)
  print("-------------------------------------------------------------------------\n")
  print_answer(question, title, text)

print("-------------------------------------------------------------------------\n")
print("Time: %s seconds" % (time.time() - start_time))

-------------------------------------------------------------------------

[1mQuestion : [0m What are the coronoviruses?


[1mTitle : [0m A viral metagenomic survey identifies known and novel mammalian viruses in bats from Saudi Arabia


[1mText : [0m c parechoviruses.


-------------------------------------------------------------------------

[1mQuestion : [0m What was discovered in Wuhuan in December 2019?


[1mTitle : [0m Old World camels in a modern world -a balancing act between conservation and genetic improvement


[1mText : [0m the photoperiod nagy & juhasz 2019 .


-------------------------------------------------------------------------

[1mQuestion : [0m What is Coronovirus Disease 2019?


[1mTitle : [0m Estimated effectiveness of symptom and risk screening to prevent the spread of COVID-19


[1mText : [0m hcov-19 has been proposed as an alternate name for the virus; jiang et al., 2020 .


-------------------------------------------------------------------

## Step 2.b - Second sentence embedding approach

### InferSent Model 

InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks. Just like SentenceBERT, we take a pair of sentences and encode them to generate the actual sentence embeddings. Then, extract the relations between these embeddings using:

* concatenation
* element-wise product
* absolute element-wise difference.

<img src="https://miro.medium.com/max/972/1*efWq1UrOcGy2E-34OxsBHQ.png" alt="InferSent" width="300"/>

In [None]:
! mkdir encoder
! curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl
  
! mkdir GloVe
! curl -Lo GloVe/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
! unzip GloVe/glove.840B.300d.zip -d GloVe/

mkdir: cannot create directory ‘encoder’: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  146M  100  146M    0     0  1024k      0  0:02:26  0:02:26 --:--:--  726k
mkdir: cannot create directory ‘GloVe’: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0   315    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0   352    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 2075M  100 2075M    0     0  2046k      0  0:17:18  0:17:18 --:--:-- 2683k
Archive:  GloVe/glove.840B.300d.zip
replace GloVe/glove.840B.300d.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
drive.mount('/content/drive',force_remount=True)
sys.path.append('/content/drive/My Drive/')
!cp -r "/content/drive/My Drive/models.py" '/content/'

In [None]:
ListOfSentences = []
numOfArticles = 0
for text in ListOfBodies:
  ListOfSentences += text
  numOfArticles+=1
  if(numOfArticles == 100):
    break

In [16]:
import models
from models import InferSent
import torch 

V = 2
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
model = InferSent(params_model)
model.load_state_dict(torch.load(MODEL_PATH))

W2V_PATH = '/content/GloVe/glove.840B.300d.txt'
model.set_w2v_path(W2V_PATH)

In [17]:
if(train_on_gpu):
  model = model.cuda()

In [None]:
import time

start_time = time.time()
model.build_vocab(ListOfSentences, tokenize=True)

InferSent_embeddings = []

for sentence in ListOfSentences:
 InferSent_embeddings.append(model.encode(sentence)[0])

for question in questions: 
  title,text = ask_question(model,question,ListOfSentences,InferSent_embeddings,dataset_df)
  print("-------------------------------------------------------------------------\n")
  print_answer(question, title, text)
  
print("-------------------------------------------------------------------------\n")
print("Time: %s seconds " % (time.time() - start_time))

## Step 2.b - Third sentence embedding approach

### Doc2Vec

Doc2vec is an unsupervised algorithm to generate vectors for sentence/paragraphs/documents. The algorithm is an adaptation of word2vec which can generate vectors for words. The vectors generated can be used for tasks like finding similarity between sentences/paragraphs/documents.

<img src="https://miro.medium.com/max/972/0*x-gtU4UlO8FAsRvL." alt="InferSent" width="400"/>

Εvery paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W. The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context. (This is for the above figure)

In [None]:
from nltk.tokenize import word_tokenize

def Doc2Vec_ask_question(model, question, ListOfSentences,dataset_df):
  test_doc = word_tokenize(question.lower())
  test_doc_vector = model.infer_vector(test_doc)
  similar = model.docvecs.most_similar(positive = [test_doc_vector])
  text = ListOfSentences[similar[0][0]]
  row = dataset_df[dataset_df['body'].str.contains(text)]
  title = row['title'].tolist()

  if not title:
    title = row['id'].tolist()[0]   
  else:
    title = row['title'].tolist()[0]

  return title,text

In [None]:
ListOfSentences = []
numOfArticles = 0
for text in ListOfBodies:
  ListOfSentences += text
  numOfArticles+=1
  if(numOfArticles == 8000):
    break

In [None]:
import time
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

start_time = time.time()

tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(ListOfSentences)]

model = Doc2Vec(tagged_data, vector_size = 20, window = 2, min_count = 1)

for question in questions: 
  title,text = Doc2Vec_ask_question(model,question,ListOfSentences,dataset_df)
  print("-------------------------------------------------------------------------\n")
  print_answer(question,title,text)


print("-------------------------------------------------------------------------\n")
print("Time: %s seconds " % (time.time() - start_time))

-------------------------------------------------------------------------

[1mQuestion : [0m What are the coronoviruses?


[1mTitle : [0m Noroviruses subvert the core stress granule component G3BP1 to promote viral VPg-dependent translation


[1mText : [0m bv2 cells were maintained in dmem supplemented with 10% fcs biosera, 2 mm l-glutamine, 0.075% sodium bicarbonate gibco and the antibiotics penicillin and streptomycin.


-------------------------------------------------------------------------

[1mQuestion : [0m What was discovered in Wuhuan in December 2019?


[1mTitle : [0m Acute systemic DNA damage in youth does not impair immune defense with aging


[1mText : [0m overall, our results highlight remarkable resilience of the immune system to withstand extensive dna damage and continue to competently operate for life., cite_spans  556,  881,  hayashi, t. heather e. lynch, susan geyer, kengo yoshida, keiko furudoi, keiko sasaki, yukari morishita, hiroko nagamura, mayumi ma

## Step 3 - Comparison of three models

## References



*   https://www.analyticsvidhya.com/blog/2020/08/top-4-sentence-embedding-techniques-using-python/
*   https://github.com/facebookresearch/InferSent
*  https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5
* https://kanoki.org/2019/03/07/sentence-similarity-in-python-using-doc2vec/

