### ** TFIDF-based Automated Literature Review**

***

This notebook applied Natural Language Processing (NLP) and other AI techniques to generate insights in the support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up. 
<br>
<br>Language is unstructured data that has been produced by people to be understood by other people. Text data is not random, it is governed by linguistic properties that make it very understandable to other people and also processable by computers !!

***

**Methodology:** The authors extracted a number of abstracts that openly studied COVID-19 and its derivatives. Each abstract was first embedded in a static TFIDF model. Then, the cosine similarity of a **dynamic user query** was computed against each abtract of the static corpus. By sorting the most similar articles to a given query, we can rapidly and easily access the most relevant articles. **This allows the medical research community, governments, and decision-makers to rapidly consult the latest findings and discoveries in a given knowledge area**  through automated literature review. Links to the full-text research paper are also embedded and directly clickable in the output dataframe.

Furthermore, the authors proposed additional text-mining approaches to generate insights from the corpus of abstracts, including 1) wordcloud of COVID-19 abstracts, 2) Word2Vec model to retrieve the most similar words to a specific word (e.g., retrieve the most similar words to *"origin"*, *"transmission"*), and 3) t-SNE visualization of semantic clusters from the corpus.
***
**Highlights:**
1.  **TFIDF-based Automated Literature Review**
2.  Wordcloud of COVID-19 Abstracts
3.  Word2Vec Model and Textual Similarities
4.  TSNE-Visualization of Semantic Clusters
***

**Pros:**
* Automation of Literature Review / Efficient Abstract Browser
* Easy and Rapid Access to the Latest Findings in a Given Domain
* Insightful Data Visualization Tools

**Cons:**
* An abstract is only a partial summary of a research paper.
* Reduced Scope: Analysis of 5,058 abstracts out of 47,000 research papers.

***

# **Part 1: Data Extraction and Preparation**

**Scope: ** To extract data, the code below scrapped the COVID-19 Open Research Dataset **(CORD-19)**, i.e., a resource of over 47,000 scholarly articles in JSON format, about COVID-19, SARS-CoV-2, and related coronaviruses. Among these articles, 26,408 articles, i.e., 56% of the corpus present abstracts that are believed to efficiently summarize the content of their respective articles. This notebook specifically focuses on the rapidly emerging literature that directly mention the terms *{"Coronavirus", "Covid", "2019-nCov", "COVID-19", "SARS-CoV-2"}* ; i.e., 5,058 abstracts that represent 19.1% of the total number of abstracts. We only considered the abstracts here, but the approach could be further applied to the text bodies.

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 0)
import io
import os
import fnmatch
import json 
import pandas as pd
import requests
import matplotlib.pyplot as plt
from time import time 
import en_core_web_sm
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
import nltk
import operator
import tqdm
from wordcloud import WordCloud, STOPWORDS
from nltk.tokenize import word_tokenize
import gensim.corpora as corpora
import gensim
from gensim.models import TfidfModel
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from IPython.core.display import HTML
import re
from tqdm import tqdm
import matplotlib.pyplot as plt
stopwords = nltk.corpus.stopwords.words('english')
nlp = en_core_web_sm.load()

In [None]:
#### Time-Consuming
t = time()

#### Create Empty DataFrame ####

ABS = pd.DataFrame(columns = ["Title", "Abstract", "URL", "Published Year"])

#### Retrieve articles URL ####

df = pd.read_csv("/kaggle/input/CORD-19-research-challenge/metadata.csv")
df=df[["title", "cord_uid", "sha", "url", "publish_time"]]

### Scrapping to retreive relevant articles ####

for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            if fnmatch.fnmatch(filename, '*.json'):
                path = os.path.join(dirname, filename)
                data = json.load(open(path))
                paper_id = data["paper_id"]                
            
                try: 
                    abst = pd.DataFrame(data["abstract"])
                    abst =abst[["text"]]
                    abst=abst.rename(columns={"text": "Abstract"})

                    art = pd.DataFrame(data["metadata"])
                    art =art[["title"]]
                    art=art.rename(columns={"title": "Title"})

                    art["Abstract"] = abst.iloc[0,0]
                    subset = df[df["sha"]==paper_id]
                    art["URL"] = subset.iloc[0,3]
                    art["Published Year"] = subset.iloc[0,4]
                    #### Track articles that explicitly mention coronaviruses and close derivatives in their abstracts ####

                    Covid_List = ["COVID", "covid", "Covid", "coronavirus", \
                                  "Coronavirus", "2019-nCov",  "SARS-CoV-2" ]                  
                    
                    if any(s in art.iloc[0, 1] for s in Covid_List):

                        ABS = pd.concat([ABS, art], sort = False)
                        
                except:
                    pass

            
ABS = ABS [["URL", "Title", "Published Year", "Abstract"]]
ABS["Published Year"] = pd.DatetimeIndex(ABS["Published Year"]).year 
ABS = ABS.drop_duplicates()
ABS.to_csv("Relevant_COVID-19_Abstracts.csv", index = False)

print("Number of Relevant Abstracts", len(ABS))

#### Embedding Hyperlinks in the DataFrame ####

for i in range(len(ABS)):
    if ABS.iloc[i, 1] != "":
        ABS.iloc[i, 1] = '<a href="{}">{}</a>'.format(ABS.iloc[i, 0], ABS.iloc[i, 1])
    else:
        ABS.iloc[i, 1] = '<a href="{}">{}</a>'.format(ABS.iloc[i, 0], "Link to Article")

ABS = ABS[["Title", "Published Year", "Abstract"]]

display(HTML(ABS.head(2).to_html(escape=False)))

print('Time to process data: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
plt.subplots(figsize = (10,6))
plt.hist(ABS["Published Year"],bins = 30, edgecolor ="black")
plt.title("Coronavirus-Related Academic Publications \n", fontsize = 20, fontweight = "bold")
plt.xlabel("Published Year")
plt.ylabel("Publications")
plt.savefig("COVID-19_Publications_Histogram.png")
plt.show()

# **Part 2: TFIDF-based Abstract Browser**

**TFIDF** for **Term Frequency–Inverse Document Frequency**, is a numerical statistic that reflects how important a word is to a document in a corpus.

\begin{equation*}
tfidf(t, d, D)  = tf(t,d) \times idf(t,D)
\end{equation*}
with:
\begin{equation*}
{\displaystyle \mathrm {tf} (t,d)=0.5+0.5\cdot {\frac {f_{t,d}}{\max\{f_{t',d}:t'\in d\}}}}
\end{equation*}
and:
\begin{equation*}
 \mathrm{idf}(t, D) =  \log \frac{N}{|\{d \in D: t \in d\}|}
\end{equation*}

where tf(t,d) represents the number of times that term t occurs in document d, idf(t, D) is is the logarithmically scaled inverse fraction of the documents that contain the word t, and N is the total number of documents d in the corpus.

In [None]:
def automated_LR(user_query):
    '''
    Function Description: Tokenize and clean each the retrieved abstracts
    Create a dictionary converting distinct words into distinct numerical IDs
    Build a tfidf model on static corpus
    Compute cosine similarity of a dynamic query against a static corpus of documents
    
    Display most similar abstracts to the dynamic query - cosine similarity scores also displayed
    
    ''' 
    #### Data Preparation of Static Corpus #### 

    data1 = ABS['Abstract'].values.tolist() # List of documents

    file_docs = [[w.lower() for w in word_tokenize(text)] for text in data1]

    dictionary = gensim.corpora.Dictionary(file_docs)

    corpus = [dictionary.doc2bow(gen_doc) for gen_doc in file_docs]

    #### TF-IDF Model ####

    tf_idf = gensim.models.TfidfModel(corpus)

    #### Compute cosine similarity of a dynamic query against a static corpus of documents ####
    
    sims = gensim.similarities.Similarity('/kaggle/working',tf_idf[corpus], num_features=len(dictionary))

    #### Process Dynamic USER QUERY ####

    tokens = sent_tokenize(user_query)
    file2_docs = []

    for line in tokens:
        file2_docs.append(line)

    for line in file2_docs:
        query_doc = [w.lower() for w in word_tokenize(line)]
        query_doc_bow = dictionary.doc2bow(query_doc) 

    #### Retrieve & Display Most Similar Abstracts ####

    query_doc_tf_idf = tf_idf[query_doc_bow]
    ABS1 = ABS
    ABS1["Similarity"] = sims[query_doc_tf_idf]
    ABS1 =ABS1.sort_values(by = "Similarity", ascending = False)
    
    display(HTML(ABS1.head(3).to_html(escape=False)))

# **Part 3: Automated Literature Review**

This approach would enable the medical research community to review rapidly the latest findings and discoveries in a given domain. Several examples are presented below. Using a dynamic query as an input, the *automated_LR* function returns the most relevant abstracts to that queries. Three examples are shown below to automate the literature review about *"What is known about transmission, incubation, and environmental stability?"* (i.e., Task1). Other tasks can be adressed similarly by adjusting dynamic user queries.

### ** <font color=green> What is known about transmission, incubation, and environmental stability? **

In [None]:
#### What is known about transmission, incubation, and environmental stability? ####

#### Dynamic User Query 1 ####
'''Provide a quick overview of findings and results'''

user_query = "Range of incubation periods for the disease in humans (and how this varies across age and health status) and how long individuals are contagious, even after recovery."

automated_LR(user_query)

In [None]:
#### Dynamic User Query 2 ####

user_query = "Persistence and stability on a multitude of substrates and sources \
                    (e.g., nasal discharge, sputum, urine, fecal matter, blood)."

automated_LR(user_query)

In [None]:
#### Dynamic User Query 3 ####

user_query = "Effectiveness of personal protective equipment (PPE) and its usefulness to reduce risk of transmission in health care and community settings"

automated_LR(user_query)

Similarly, the literature review of other tasks (e.g.,*What do we know about COVID-19 risk factors?*, *What do we know about vaccines and therapeutics?*, *"What has been published about medical care?"*, or *What do we know about non-pharmaceutical interventions?*) can be automated by adjusting the user query with any task subsection, as follows.

### ** <font color=green> What do we know about COVID-19 risk factors?**

In [None]:
#### What do we know about COVID-19 risk factors?  ####

user_query = "Data on potential risks factors: smoking, pre-existing pulmonary disease, co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other comorbidities, neonates and pregnant women."

automated_LR(user_query)

### ** <font color=green> What do we know about vaccines and therapeutics?**

In [None]:
#### What do we know about vaccines and therapeutics?  ####

user_query = "Effectiveness of drugs being developed and tried to treat COVID-19 patients. Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication."

automated_LR(user_query)

### ** <font color=green> What has been published about medical care?**

In [None]:
#### What has been published about medical care?  ####

user_query = "Best practices and critical challenges and innovative solutions and technologies in hospital flow and organization, workforce protection, workforce allocation, community-based support resources, payment, and supply chain management to enhance capacity, efficiency, and outcomes."

automated_LR(user_query)

### ** <font color=green> What do we know about non-pharmaceutical interventions?**

In [None]:
#### What do we know about non-pharmaceutical interventions?  ####

user_query = "Methods to control the spread in communities, barriers to compliance and how these vary among different populations."

automated_LR(user_query)

# **Part 4: Wordcloud of COVID-19 Abstracts**

In a wordcloud, the importance of each word is shown with font size. In this section, a wordcloud of the most frequent words appearing in corpus of COVID-19 abstracts is built. A number of preprocessing steps (e.g., tokenization, lemmatization) are required to build a word cloud. As expected, words such as *"coronavirus"*, *""infection*, and *"respiratory"* are particularly prominent. Other words such as *"vaccine"* and *"proteine"* have also been extensively discussed in literature.  

In [None]:
def wordcloud(df, name):

    words_dict = {}
    stopwords = nltk.corpus.stopwords.words('english')
    stopwords.extend(["also", "three", "may"])

    text = " ".join(review for review in df[name])
    tokenizer = RegexpTokenizer(r'\w+')
    TK = tokenizer.tokenize(text.lower())

    filtered_words = list(filter(lambda word: word not in stopwords, TK))

    filtered_words1 = [w for w in filtered_words if w.isalpha()]
        
    lemmatizer = WordNetLemmatizer()

    for w in range(len(filtered_words1)):
        filtered_words1[w] = lemmatizer.lemmatize(filtered_words1[w])

    for word in filtered_words1:
        words_dict[word] = words_dict.get(word, 0) + 1

    sorted_d = sorted(words_dict.items(), key=operator.itemgetter(1), reverse=True)
    
    return WordCloud(max_words=200, max_font_size=50, relative_scaling=0.5, stopwords=stopwords,
                background_color="white").generate_from_frequencies(words_dict) 
    

In [None]:
plt.subplots(figsize = (10,8))
 
plt.imshow(wordcloud(ABS, "Abstract"), interpolation='bilinear')
plt.axis("off")
plt.title("Wordcloud of Abstracts \n", fontsize = 24, fontweight = "bold")
plt.savefig("Abstracts_Wordcloud.png")
plt.show()

# **Part 5: Word2Vec Model and Text Similarities**

A **Word2Vec** model (Word to Vector) was built using Gensim Python library to produce word embeddings. Using a large corpus of text as an input, a Word2vec model returns a vector (here, 100 dimensions) for each unique word in the corpus. The similarity between vectors is measured through the cosine similarity metric. Similar vectors represent words that are semantically related in the original corpus.

In [None]:
import en_core_web_sm
nlp = en_core_web_sm.load()

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out


def text_data(df, name):
    
    corpus1 = []

    tokenizer = RegexpTokenizer(r'\w+')
    stopwords = nltk.corpus.stopwords.words('english')
           
    for sentence in df[name]:
            word_list = tokenizer.tokenize(sentence.lower())
            word_list1 = [word for word in word_list if word.isalpha()]
            word_list2 = [word for word in word_list1 if word not in stopwords]
            corpus1.append(word_list2)

    bigram = gensim.models.Phrases(corpus1, min_count=10, threshold=100)  # higher threshold fewer phrases.
    trigram = gensim.models.Phrases(bigram[corpus1], threshold=100)

        # Faster way to get a sentence clubbed as a trigram/bigram
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    trigram_mod = gensim.models.phrases.Phraser(trigram)

    def make_bigrams(texts):
        return [bigram_mod[doc] for doc in texts]

    def make_trigrams(texts):
        return [trigram_mod[bigram_mod[doc]] for doc in texts]

    corpus2 = make_bigrams(corpus1)

    corpus2= make_trigrams(corpus2)

    corpus2 = lemmatization(corpus2, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
    
    return corpus2

covid_corpus = text_data(ABS, "Abstract")

In [None]:
from gensim.models import word2vec, KeyedVectors
filename = 'Word2Vec_Covid_100dimensions'

#### Build Word2Vec model ####

model = word2vec.Word2Vec(covid_corpus, size=100, window=8, min_count=10, workers=10)
model.train(covid_corpus, total_examples=len(covid_corpus), epochs=15)
model.wv.save(filename)
word_vectors = KeyedVectors.load(filename)

#### Example of Word Embedding ####

print("Word Embedding for 'Coronavirus'", model.wv['coronavirus'])

In [None]:
#### Most Similar Words to ... ####


word = "diagnostic"
print("Most similar words to {}:".format(word))
print(word_vectors.most_similar(positive=word, topn=10))
print("\n")
word = "surveillance"
print("Most similar words to {}:".format(word))
print(word_vectors.most_similar(positive=word, topn=10))
print("\n")
word = "origin"
print("Most similar words to {}:".format(word))
print(word_vectors.most_similar(positive=word, topn=20))
print("\n")
word = "transmission"
print("Most similar words to {}:".format(word))
print(word_vectors.most_similar(positive=word, topn=20))

For instance, it is interesting to understand at a glance that the word ***"origin"*** is mainly linked to terms including the bigram ***"interspecies_transmission"***, and words such as ***"pangolin"***, ***"evolution"***, and ***"zoonotic"*** (FYI, A zoonosis is an infectious disease caused by a pathogen that has jumped from non-human animals to humans). <br><br>

Also, regarding what is known about transmission, it is worth having a look at ***"transmission"***'s most similar words, which include insightful words such as ***"travel"***, ***"contact"***, and ***"fomite"*** (FYI, A fomite is any inanimate object that, when contaminated with or exposed to infectious agents, can transfer disease to a new host).

# **Part 6: T-SNE Visualization of Semantic Clusters**

The **T-distributed Stochastic Neighbor Embedding (t-SNE)** dimensionality reduction technique was ultimately applied to project the 2D position of each word with its label. A machine learning **Kmean** algorithm was also implemented using *Scikit-learn* Python Library to partition n words into semantic clusters. To determine the optimal number of clusters K, the **elbow method** was used with below the plot of sum of squared distances for K in the range [1, 30]. If the plot looks like an arm, then the elbow on the arm is the optimal K. Here, **K =7**.

In [None]:
def Word2Vec_Sorted(model):
    ''' 
    Function to extract the word2vec embeddings 
    of the most frequent terms in the corpus
    '''
    stopwords.extend(["also", "however", "could"])
    w2c = dict()  
    
    for item in model.wv.vocab:
        w2c[item]=model.wv.vocab[item].count
    w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))
    w2cSortedList = list(w2cSorted.keys())
    w2cSortedList = [word for word in w2cSortedList if word not in stopwords]
    
    return w2cSortedList


#### Implementation of the elbow method to find the optimal number of clusters K ####

Sum_of_squared_distances = []
tokens = []

for word in Word2Vec_Sorted(model):
    tokens.append(model[word])

tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=500)
new_values1 = tsne_model.fit_transform(tokens)

K = range(1,30)

for k in tqdm(K):
    km = KMeans(n_clusters=k)
    km = km.fit(new_values1)
    Sum_of_squared_distances.append(km.inertia_)
    
#### Plot the "elbow" curve ####
plt.subplots(figsize = (10,6))
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('Number of Clusters')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal Number of Clusters K')
plt.savefig("Elbow_Method_Optimal_K.png")
plt.show()

In [None]:
def tsne_plot(model, key_words):

    "Creates a TSNE model and plots it"
    labels = []
    tokens = []
    
    #### For visualisation and clarity purposes, we only project the 300 most frequent words ####
    
    for word in Word2Vec_Sorted(model)[:300]:
        tokens.append(model[word])
        labels.append(word)

    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=500)
    new_values = tsne_model.fit_transform(tokens)
    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])

    clusters = KMeans(n_clusters=7)
    clusters.fit(new_values)
    y_kmeans = clusters.predict(new_values)
    
    colmap = {0: 'red', 1: 'green', 2: 'blue', 3 :'black', 4:'fuchsia', 5:'orange', 6:'grey', 7:'grey'}

    dict1={}
    for i in range(len(colmap)):
        dict1[colmap[i]]=list(y_kmeans).count(i)/len(y_kmeans)*100
    
    plt.figure(figsize=(20,15))

    plt.title('TSNE Model Vizualization \n', fontsize = 20, fontweight = "bold")
    plt.xlabel('Dimension 1')
    plt.ylabel('Dimension 2')

    for i in range(len(new_values)):
        plt.scatter(x[i], y[i], color=colmap[y_kmeans[i]], s=12)
        if labels[i] not in key_words:
            plt.annotate(labels[i],
                         xy=(x[i], y[i]),
                         xytext=(5, 2),
                         textcoords='offset points',
                         ha='right',
                         va='bottom', fontsize=12)
        else:
            
            plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom', fontsize = 12, weight="bold", color= 'red')
    
 
    plt.savefig("tsne_covid.png")
    plt.show()
    plt.close()

In [None]:
t = time()

#### TSNE Data Visualization ####

key_words = ['origin', 'transmission', 'symptom', 'surveillance', 'diagnostic', 'patient', 'cause', 'outbreak', "vaccine"]

tsne_plot(model, key_words)

print('Time to process data: {} mins'.format(round((time() - t) / 60, 2)))

From the above TSNE visualization of Word2Vec embeddings, we can distinguish several clusters among which we can recognize semantic similarities including for instance, *medical treatment,  vaccine research, epidemiological research, covid-19 detection, transmission, causes and consequences of the disease*... Inter-word distance in the 2D plane is an indication of inter-word similarity.