In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
import torch
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras.layers import Input, Lambda, Dense
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K
from tqdm import tqdm_notebook

import helper_functions as hf
import meta_cleaning as mc
import eda_text as et
import config 
import viz_plot as vp
import word_cloud_prep as wcp
import covid_clustering as  cc

#import biobert_embedding as be
#import spacy
import matplotlib.pylab as plt
import plotly.express as px
from collections import defaultdict
from timeit import default_timer as timer
from IPython.display import Image
#import tabulate

#spacytokenizer = spacy.tokenizer.Tokenizer(be.nlp.vocab)

# Any results you write to the current directory are saved as output.
ROOTDIR = "../input"
DATADIR = os.path.join(ROOTDIR, 'CORD-19-research-challenge')

> 

In [None]:
Image("../input/covid-image/cov.png")

# Introduction

> **"Nothing in life is to be feared, it is only to be understood. Now is the time to understand more, so that we may fear less." **
--Marie Curie

In the midst of this crisis, we are a team of data scientists who would like to put our skills into good use, to shed light by providing a tool to the scientific community using NLP to answer key questions from the scientific literature. 

This tool we are building here is a search engine based on the similarity score calculated on embeddings using [Universal Sentence Encoder](https://arxiv.org/abs/1803.11175) model and its pretrained weights can be found on tensorflow hub via this [link](https://tfhub.dev/google/universal-sentence-encoder/4). We also provide a clustering among all articles so that users can get some insights by navigating between articles that are similar.

We attempt to write a kernel that is more like an article. If you want to get your hand dirty, please feel free to have a look at the enclosed utility scripts.

# Parsing
We first would like to include information from the json files to get a complete dataset. The following steps parse documents from Biorxiv, comm use subset and non-comm use subset as indicated the CORD-19-research-challenge.

In [None]:
df_meta = pd.read_csv(os.path.join(DATADIR, "metadata.csv"))

**Parsing Biorxiv articles**

In [None]:
biorxiv_path = os.path.join(DATADIR, "biorxiv_medrxiv/biorxiv_medrxiv/pdf_json/")
df_biorxiv = mc.parse_biorxiv(biorxiv_path)

**Parsing Comm use subset articles**

In [None]:
comm_subset_path = os.path.join(
    DATADIR, "comm_use_subset/comm_use_subset/pdf_json/"
)
df_comm = mc.parse_comm(comm_subset_path)


**Parsing Non Comm use subset articles**

In [None]:
noncomm_subset_path = os.path.join(
    DATADIR, "noncomm_use_subset/noncomm_use_subset/pdf_json/"
)
df_noncomm = mc.parse_noncomm(noncomm_subset_path)

**Merge dataset**

We proceed by merging all the dataset using the unique key "sha", and check if there are some duplicates in our dataset.

In [None]:
df_meta.columns

In [None]:
# Convert publish time to type publish_date 
df_meta["publish_date"] =  pd.to_datetime(df_meta["publish_time"])
df_merge = mc.merge_datasets(df_meta, df_biorxiv, df_comm, df_noncomm)
df_merge_impute = mc.impute_columns(df_merge)
df_meta_comp = mc.drop_duplicates(df_merge_impute)
df_meta_comp.index.name ="row_id"
df_meta_comp.to_csv("df_meta_comp.csv")
del df_meta, df_merge_impute, df_comm, df_noncomm, df_biorxiv

A peek on the completed meta dataset.

In [None]:
df_meta_comp.head()
df_meta_comp.shape

# EDA
## Understanding variables 
Before working on the search engine, we perform some data exploration to get a hint of what data are awaiting us.

We first have a look at the missing values. This helps us first understand the variables; and subsequently help us in choosing the variables to work with.

In [None]:
mc.plot_missing_value_barchart(df_meta_comp)

From the above count plot, we can identify the meaning behind each variable as follows, most of them are self-explanatory:

* ID numbers:
    * cord_uid,
    * doi,
    * pubmed_id (has been exlcuded),
    * pmcid (has been exlcuded),
    * WHO #Covidence (has been exlcuded),
    * Microsoft Academic Paper ID (has been exlcuded),
    * sha (only this is important to retrieve data from jsons)
* url: link to the article
* source_x: source of the article
* has_full_text: (boolean) indicate whether the article has full text
* journal
* authors_list
* first_author: the first ofauthor of the article
* last_author: the author who is in the last position, corresponds normally to the chief of the research institute
* affiliations
* bibliography
* raw_bibliography : bibliography in its raw position
* title
* abstract
* text
* publish_date
* full_text_file: path to the json file (has been excluded)

A lot of them are related to IDs, here we are only interested in one which is sha which will act as a unique key to retrieve the articles. Among other variables that are useful are url for documentation purpose, source_x, journal, authors, title, abstract and publish_time.

Note that there are several articles (15 of them) that do not contain title. By right, they should be exlcuded from this dataset as they are not informative.

## On the authors

In [None]:
df_authors = df_meta_comp["authors"].apply(mc.author_feats)
mc.plot_num_author_distrib(df_authors["num_authors"])
del df_authors

Nowadays, an article is often an contribution of several people, most likely within a range of 2 to 30.

## On the sources

In [None]:
mc.plot_article_sources_distrib(df_meta_comp)

We observe there are many contributions for PubMed (PMC) and Elsevier, and the rest not as many, it might be dependant on the establishment of these journals. Here is summary of these journals.
- **PubMed Central® (PMC)**: free full-text archive of biomedical and life sciences journal literature at the U.S, about 5.9 MILLION Articles are archived in PMC.
- **Elsevier**: Dutch publishing and analytics company specializing in scientific, technical, and medical content.
- **WHO**: World Health Organization
- **biorxiv**: the preprint server for biology. 
- **merdxiv**: the preprint server for medical health science
- **CZI**: Chan Zuckerberg Initiative

## On the publish date

In [None]:
df_publish_date = mc.groupby_publish_date(df_meta_comp)
mc.plot_publish_date_distrib(df_publish_date)
del df_publish_date

We observe that there are spikes of publication at the end of the year. Do they only concern certain journals?

Looking at this time plot, we strongly suspect that there is a dependance between the publish date and the sources, let's have a look at that to confirm this hypothesis. We shall only look at the articles from 2002 onwards, as articles before this year are not as frequent.

In [None]:
df_date_source = mc.gropuby_date_source(df_meta_comp)
mc.plot_publish_date_wrt_sources(df_date_source)
del df_date_source

While we see the publication of several articles fairly recently, eg WHO, medrxiv and CZI, PubMed and Elsevier on the other hand have contributions since 2002, and they are the ones which have spikes of publication at the end of the year.

For publications prior to 2002, topics are mostly about on ILI (Influence-like illnesses) as SARS virus has not known yet.

In the coming days/weeks, we'll see an avalanche of papers submitted to biorxiv and medrxic as they don't require any peer reviewing to get published. That implies that the to-be-developed search engine has to be scalable.

## EDA on text data
**On the title**

We perform a light EDA on text data, the NLP preprcocessing pipeline includes the following steps:
- lowercase
- remove puncuations
- remove https
- remove stopwords

In [None]:
df_preprocess_title = df_meta_comp["title"].apply(
        lambda x: et.nlp_preprocess(x, config.PUNCT_DICT)
    )
    
df_wc_title = et.corpora_freq(df_preprocess_title)
et.plot_distrib(df_wc_title, "title");
del df_preprocess_title

We observe that the top words are dominated by influenza-like illnesses and COVID-19 related vocabularies, eg virus, respiratory, coronavirus. It is also noted that the vocab is also related to research terms, such as clinical, study, analysis. There is no doubt that we are dealing with COVID-19 literature.

**On the abstract**

In [None]:
df_preprocess_abstract = df_meta_comp["abstract"].apply(
        lambda x: et.nlp_preprocess(x, config.PUNCT_DICT)
    )
df_wc_abstract = et.corpora_freq(df_preprocess_abstract)
et.plot_distrib(df_wc_abstract, "abstract");

del df_preprocess_abstract

The word count plot above shows a richer vocab that does not always seem specified to the medical research literature at hand, eg also, may, can, however, not, abstract ... By right, if we were handling the corpus with the bag-of-words approach, these words should be removed. This plot has been insightful in allowing us to build our strategy in our following studies.

**On the text**

In [None]:
df_preprocess_text = df_meta_comp["text"].apply(
        lambda x: et.nlp_preprocess_text(x, config.PUNCT_DICT)
)
df_wc_text = et.corpora_freq(df_preprocess_text)
et.plot_distrib(df_wc_text, "text");
del df_preprocess_text

If the abstract has provided us with some uninformative words, we observe that this type of vocab is even more present in the text.

## On the affiliation

In [None]:
df_process_affiliation = df_meta_comp["affiliations"].apply(et.process_affiliations)

df_wc_affiliation = et.corpora_freq(df_process_affiliation, affiliation=True)
ax = et.plot_distrib(df_wc_affiliation.iloc[1:], "affiliation")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);
del df_process_affiliation

The top 10 institutitions that have contributed to this dataset are mostly Chinese institutions. This makes sense as China is where the first-wave of COVID-19 has taken place. In summary, the contribution to this dataset is a global effort.

# Embedding
In this section, we will proceed with the embeddings of our text data. The idea of embedding is to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The embedded data are also in a lower dimension as compared to only using classical methods, such as count or tfidf vectorizations Once we have the embeddings, we use them as latent variables for further processing, eg clusterisation, similarity computation with embeded queries... Here, we focus on the aforementioned tasks.

Now comes the question of what embedding to be used. We have previously written a kernel on applying [BioBERT embeddings](https://www.kaggle.com/pmlee2017/biobert-embedding/), however without fine-tuning, the performance is not what we expected. In this kernel, we are using USE embeddings and perform a comparison between using only title and both title and abstract.

## Universal Sentence Encoder in a nutshell

There are some key elements a reader ought to know while using USE embeddings.
- The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. 
- It is trained on a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks, such as semantic similarity, and classification.
- The input is variable length English text and the output is a 512 dimensional vector
- The USE model is trained with a deep averaging network (DAN) encoder.

### Semantic Similarity
The task at hand is to find similar documents by evaluating semantic similarity. Semantic similarity is a measure of the degree to which two pieces of text carry the same meaning. As shown in the following image, text of a document is first embedded by a model. And by computing similarity scores across the embeddings of all documents in the corpus, we are able to use this score accordingly to rank the documents.




In [None]:
Image("../input/semantic-similarity/Capture decran 2020-04-16 a 16.10.28.png")

## Load USE model
USE model can be obtained from this [Tensorflow Hub](https://tfhub.dev/google/universal-sentence-encoder/4) page. It takes a minute to load the model, and the inference is relatively fast on ~50000 articles. 


In [None]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
embed = hub.load(module_url)

def UniversalEmbedding(x):
    return embed(tf.squeeze(tf.cast(x, tf.string)))

input_text = Input(shape=(1,), dtype=tf.string)
embedding = Lambda(UniversalEmbedding, output_shape=(512, ))(input_text)
model = Model(inputs=[input_text], outputs=embedding)

# define some parameters
nsamples = 5000

## USE on title

In this section, we carry out the analysis only on titles.

In [None]:
title_len = df_meta_comp["title"].apply(lambda x: len(x.split()) if not pd.isnull(x) else np.NaN)
mean_title_len = np.nanmean(title_len)
print("Mean title length: %.1f words" %mean_title_len)

In [None]:
df_title = df_meta_comp["title"].dropna()

# keep id
id_titles = df_meta_comp[~df_meta_comp.title.isna()].index

In [None]:
embeddings_titles = model.predict(df_title, batch_size=1024, verbose=1)
df_embedding_title = pd.DataFrame(embeddings_titles, index = id_titles)
df_embedding_title.name = "row_id"
embeddings_titles = np.concatenate((id_titles.values.reshape(-1, 1), embeddings_titles), axis=1)
np.save('embeddings_titles.npy', embeddings_titles)
del embeddings_titles


### Clustering
With embeddings done, we can proceed with clustering. We use MiniBatch Kmeans to cluster our datapoints, this again goes with our aim of performing all computations in batch. Then we perform a PCA projection on 3D. 

In [None]:
n_clusters = np.arange(4, 32, 4)
df_clusters = cc.miniBatchClustering(df_embedding_title, None, n_clusters)

To determine the optimal number of clusters, we use the Silhouette score to measure how similar an object is to its own cluster (cohesion) compared to other clusters (separation), and this metric gives a value between -1 to 1 with -1 meaning the datapoint does not belong to the attributed cluster at all, and 1 indicating good clustering.

In [None]:
import matplotlib.pylab as plt
fig, axes = plt.subplots(3, 3, figsize=(12, 16))

sample_title = df_embedding_title.sample(nsamples, random_state=42)
idx_title = sample_title.index
cc.plot_silhouette_graph(
        df_embedding_title.sample(nsamples, random_state=42),
        df_clusters.loc[idx_title],
        n_clusters,
        fig,
        axes,
    )

From the average Silhouette score, we observe that the value is rather low, but again this score should not be the hard and fast rule in determining the optimum number of clusters. Looking closely to the Silhouette plot, we observe that a lot of the datapoints are wrongly attributed but their Silhouette scores are contained between -0.1 and 0 signifying they are not that badly attributed. Of course, looking at the positive side, the Silhouette scores are not that high either. 

For low number of clusters (4-8), we observe that cluster size is rather uneven, and potentially they need to be divided into more sub clusters. Hence higher number of clusters is prioritised.

The fact that many datapoints are attributed to the wrong cluster might indicate that a lot of articles are cross-discipline, and therefore don't belong to only one cluster. 

From this analysis, we have seen that the Silhouette graph is not very conclusive. We pursue by measuring another metric which is the Calinski-Harabasz score, it is defined as ratio between the within-cluster dispersion and the between-cluster dispersion.


In [None]:
df_calinski = cc.evaluate_metric_score(df_embedding_title, df_clusters, metric="calinski")
df_calinski["n_cluster"] = n_clusters

In [None]:
ax = cc.plot_metric_vs_cluster(df_calinski)
x = df_calinski['n_cluster'].values
y = df_calinski['metric'].values

a1 = (y[1]-y[0])/(x[1]-x[0])
b1 =1270

def f(x, a, b):
    return a*x+b

x_asympt= np.arange(4., 13.)
y1 = f(x_asympt, a1, b1)

ax.axhline(300,  color="black", linestyle="dashed", linewidth=1)
ax.axvline(11, ymax=.6, color="red", linestyle="dashed", linewidth=1)

ax.plot(x_asympt, y1, color="black", linestyle="dashed", linewidth=1)
ax.set_ylabel("Calinksi-Harabasz score");


By plotting the Calinski-Harabasez score wrt number of clusters, we observe an elbow. Though it's not a very precise method of determining the optimal number of clusters, but it at least provides us with some elements of response. The optimal number of clusters determined here is 11. 


### Visualization
Another way to check if our datapoints are being correctly clustered, we can visualize in lower dimensions. We choose to perform a PCA with 3 components to have a projection in 3D. Before carrying out the PCA projection, we sphereize our datapoints (the embeddings) by shifting each datapoint by the centroid and make it unit norm. By doing this the datapoints will be better distributed in the spherical space.

**Give it a spin at the 3D plot, and hover over the datapoints to get some information.**

In [None]:
X = vp.sphereize(df_embedding_title.values)
index = df_embedding_title.index
colors = px.colors.qualitative.Alphabet + px.colors.qualitative.Plotly
df_pca = vp.compute_pca(X, index)
df_resampled = vp.resample(df_pca, df_meta_comp, df_clusters, colors, nsamples=nsamples, n_cluster=24)
info_title = vp.prepare_info(df_resampled)
vp.plot_tsne(df_resampled, "X_0", "X_1", info_title, var_z="X_2")

We observe that the clustering is rather coherent. In order to evaluate the relevance of clusters, our next step is to seek experts.

### Wordcloud
We further plot the wordcloud for each cluster to help us with the interpretation. Since the vocab of USE are wordpieces, it doesn't make too much sense to use them for our wordcloud. Therefore, we proceed with the conventional NLP preprocessing pipeline: Lemmatokenization, Stopword removal, Punctuation removal, and Count vectorization. 

We first perform a preprocessing on the general corpus to identify the top words and then consider them as stopwords.

In [None]:
df_title_cluster = df_clusters.merge(
        df_title, left_index=True, right_index=True
    )

In [None]:
df_title_cluster["title_process"] = df_title_cluster["title"].apply(
        lambda x: et.nlp_preprocess(str(x), config.PUNCT_DICT)
        if not pd.isnull(x) else "")

df_title_wc = wcp.prepare_word_cloud(df_title_cluster["title_process"])
# get rid of top 10 words
extra_stopwords = df_title_wc.head(10).keys().tolist()

extra_stopwords += ["covid", "sars", "infectious", "19", "volume", "index", "chapter", "volume", "1", "de", "la"]


Now we loop over all clusters and preprocess the text in order to give coherent vocabularies for the wordclouds.


In [None]:
wc_title = defaultdict(int)
ncluster = 12
nclusters = sorted(
        df_title_cluster["labels_cluster_%d" % ncluster].unique().tolist()
    )

for k in nclusters:
    temp = df_title_cluster[df_title_cluster["labels_cluster_%d" % ncluster] == k][
        "title_process"
    ]

    try:
        wc_title[k] = wcp.prepare_word_cloud(temp, extra_stopwords)
    except ValueError:
        pass

n_top = 50
fig, axes = plt.subplots(4, 3, figsize=(16, 16))
ax = axes.flatten()
wcp.plot_word_cloud(wc_title, ax)
plt.tight_layout(w_pad=2.0)
del wc_title, df_title_cluster, df_title_wc

Above each word cloud is the top 3 word occurences. Topics are quite distinctive knowing that we have only used title as input. Among the topics, we have syndromes, detection and analysus, genetics (rna), receptor and inhibitor...


## USE on title and abstract
For this task, we concatenate title and abstract. The following procedure ressembles that of embedding on title.

In [None]:
df_title_abstract = df_meta_comp['title'] + "\n" + df_meta_comp['abstract'].fillna('') 
df_title_abstract = df_title_abstract.dropna()
id_title_abstract = df_title_abstract[~df_title_abstract.isna()].index

In [None]:
title_abstract_len = df_title_abstract.apply(lambda x: len(x.split()) if not pd.isnull(x) else np.NaN)
mean_title_abstract_len = np.nanmean(title_abstract_len)
print("Mean title abstract length: %.1f words" %mean_title_abstract_len)

In [None]:
embeddings_title_abstract = model.predict(df_title_abstract, batch_size=1024, verbose=1)
df_embedding_title_abstract = pd.DataFrame(embeddings_title_abstract, index = id_title_abstract)
df_embedding_title_abstract.name = "row_id"
embeddings_title_abstract = np.concatenate((id_title_abstract.values.reshape(-1, 1), embeddings_title_abstract), axis=1)
np.save('embeddings_title_abstract.npy', embeddings_title_abstract)
del embeddings_title_abstract

In [None]:
n_clusters = np.arange(4, 30, 4)
df_clusters_title_abstract = cc.miniBatchClustering(df_embedding_title_abstract, None, n_clusters)

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(12, 16))

sample_title_abstract = df_embedding_title_abstract.sample(nsamples, random_state=42)
idx_title_abstract = sample_title_abstract.index
cc.plot_silhouette_graph(
        df_embedding_title_abstract.sample(nsamples, random_state=42),
        df_clusters_title_abstract.loc[idx_title_abstract],
        n_clusters,
        fig,
        axes,
    )

These Silhouette plots look a little bit different as compared to only using title, the values are higher and the errors in cluster attribution seem to be lower, however this analysis still does not allow us to determine the optimal number of clusters. Let's have a look at the Calinski-Harabasz score.

In [None]:
df_calinski_title_abstract = cc.evaluate_metric_score(df_embedding_title_abstract, df_clusters_title_abstract, metric="calinski")
df_calinski_title_abstract["n_cluster"] = n_clusters

In [None]:

ax = cc.plot_metric_vs_cluster(df_calinski_title_abstract)
x = df_calinski_title_abstract['n_cluster'].values
y = df_calinski_title_abstract['metric'].values

a1 = (y[1]-y[0])/(x[1]-x[0])
b1 = 2350

def f(x, a, b):
    return a*x+b

x_asympt= np.arange(2., 14.)
y1 = f(x_asympt, a1, b1)

ax.axhline(500,  color="black", linestyle="dashed", linewidth=1)
ax.axvline(12, ymax=.5, color="red", linestyle="dashed", linewidth=1)
ax.plot(x_asympt, y1, color="black", linestyle="dashed", linewidth=1)

ax.set_ylabel("Calinski-Harabasz score");

### Visualization 
Projection with PCA.

In [None]:
X_title_abstract = vp.sphereize(df_embedding_title_abstract.values)
index_title_abstract = df_embedding_title_abstract.index
colors = px.colors.qualitative.Alphabet
df_pca_title_abstract = vp.compute_pca(X_title_abstract, index_title_abstract)
df_resampled_title_abstract = vp.resample(df_pca_title_abstract, df_meta_comp, 
                                          df_clusters_title_abstract, colors, nsamples=nsamples, n_cluster=12 )
info_title_abstract = vp.prepare_info(df_resampled_title_abstract)
vp.plot_tsne(df_resampled_title_abstract, "X_0", "X_1", info_title_abstract, var_z="X_2")

From this visualization, we observe that more distinctive clusters than the case with only title. This indicates that including abstract adds more information, and thus leading to a better clustering.

### Wordcloud
We plot the wordcloud for each cluster.

In [None]:
df_title_abstract.name = "title_abstract"
df_title_abstract_cluster = df_clusters_title_abstract.merge(
        df_title_abstract, left_index=True, right_index=True
    )
df_title_abstract_cluster["title_abstract_process"] = df_title_abstract_cluster["title_abstract"].apply(
        lambda x: et.nlp_preprocess(str(x), config.PUNCT_DICT)
        if not pd.isnull(x) else "")

In [None]:
df_title_abstract_wc = wcp.prepare_word_cloud(df_title_abstract_cluster["title_abstract_process"])
# since there are more words, we consider the top 20 words as stop words
extra_stopwords_title_abstract = df_title_abstract_wc.head(20).keys().tolist()

extra_stopwords_title_abstract += ["covid", "sars", "infectious", "19", "may", "can", "volume", "index", "chapter", "volume", "used", "also", "de", "la"]

In [None]:
wc_title_abstract = defaultdict(int)
ncluster = 12
nclusters = sorted(
        df_title_abstract_cluster["labels_cluster_%d" % ncluster].unique().tolist()
    )

for k in nclusters:
    temp = df_title_abstract_cluster[df_title_abstract_cluster["labels_cluster_%d" % ncluster] == k][
        "title_abstract_process"
    ]

    try:
        wc_title_abstract[k] = wcp.prepare_word_cloud(temp, extra_stopwords_title_abstract)
    except ValueError:
        pass

n_top = 50
fig, axes = plt.subplots(4, 3, figsize=(16, 16))
ax = axes.flatten()
wcp.plot_word_cloud(wc_title_abstract, ax)
plt.tight_layout(w_pad=2.0)
del wc_title_abstract, df_title_abstract_cluster, df_title_abstract_wc 

As compared to the previous study using only title, we observe that the vocab is richer and the topics are also more distinctive. We can easily identify some coronavirus (Covid-19) related topics such as detection and samples, vaccines devopement, the genes (sequence, replication, genome structure and method), the outbreak in China, public care system, data and modeling, treatments and effects...

# Demo
In order to have an idea on how the difference between the two studies, we have prepared a demo to have a sense of the performance using ipywidgets. You are looking at results compared with embeddings from title only.

In [None]:
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from ipywidgets import interact, interact_manual, Textarea, Layout

In [None]:
text_area_layout = Layout(width="70%", height="50px")
text_area = widgets.Textarea(value="Incubation periods", placeholder="Enter your text here.", layout=text_area_layout)

int_slider_layout = Layout(width="50%")
int_slider = widgets.IntSlider(description="Select number of results to show",
                               min=1, 
                               max=40, 
                               value=10, 
                               layout=int_slider_layout,
                               style={'description_width': 'initial'}
                              )

radio_buttons_layout = Layout(width="50%")
radio_buttons = widgets.RadioButtons(description="select embeddings", 
                                     value='title', 
                                     options=['title', 'title+abstract'],
                                     style={'description_width': 'initial'},
                                     layout=radio_buttons_layout
                                    )

toggle_button = widgets.ToggleButton(value=True)

checkbox = widgets.Checkbox(value=False, description='Show abstracts', disabled=False, indent=False)

In [None]:
@interact
def plot_search_results(emb=radio_buttons, n=int_slider, show_abstracts=checkbox, text=text_area):
    if text.strip() != '':
        if emb == "title": 
            embs = df_embedding_title
        elif emb == "title+abstract":
            embs = df_embedding_title_abstract
      
        print(f"Displaying {n} most similar results for \n{text} ...\n")
        
        embedding_text = embed([text])
        embedding_text = embedding_text.numpy()
        similarities = np.inner(embedding_text, embs.values)

        indices = np.argsort(similarities)[0]
        indices = indices[::-1][:n]

        row_ids = embs.iloc[indices].index
        row_ids = list(map(int, row_ids))

        for i, (row_id, index) in enumerate(zip(row_ids, indices)):

            title = df_meta_comp.loc[row_id]['title']
            abstract = df_meta_comp.loc[row_id]['abstract']
            print(f'result {i} title : {title}')
            print(f'similarity : {similarities[0][index]}')
            
            if show_abstracts:
                print('')
                print(f'result {i} abstract : {abstract}')

            print('----' )
    else:
        print('no query, no results baby.')


By testing with some queries, it is observed that first five topics are rather relevant. In terms of their similarities scores, which is a measure of confidence, they are very often capped at 0.65. This might be due to the fact that the model is trained on a general text corpus, and not specific enough to scientific articles, and even less to articles related to Covid-19.

# Conclusions
In this kernel, we have used USE embeddings to embed our text data, and using this to calculate the similarity with the query entered by a user. We have performed clustering on the datapoints and have used PCA to reduce the number of dimensions and visualize them in three dimensions. With word cloud, we have been able to get an idea of the topic of each cluster. And finally, we build a demo to test out two settings, one with the embeddings on title only and the other one on both title and abstract.

We have observed some differences while comparing the returned results between using only title and both title and abstract. From the point of view of a search engine, users often only use several keywords as queries, hence it is more reasonable to compare with embeddings infered from only title because they are also short and precise. On the other hand, the embeddings with title and abstract have proven to be useful in clustering because they contain more information.

The takeaway message here is that embeddings with title are useful for query searching, and embeddings with title and abstract are useful for clustering.

# Next steps

From this study, while embeddings using USE have been able to recommend some relevant articles for the input queries, there is still room for improvement as we have seen that the confidence of the model given by the similarity score is not exceptional, we will proceed by fine tuning the USE model on the CORD19 research challenge corpus, and as a comparison we will carry out the same task with the BioBERT model. Stay tuned for more 😀.

Here is a list of entries to the CORD19 research challenge that use this method to extract information:
* [What is known about transmission, incubation, and environmental stability?](https://www.kaggle.com/pmlee2017/use-semantic-similarity-transmission-incubation/) 
* [What do we know about virus genetics, origin, and evolution? ](https://www.kaggle.com/pmlee2017/use-semantic-similarity-virus-genetics)  
* [What do we know about COVID-19 risk factors?](https://www.kaggle.com/pmlee2017/use-semantic-similarity-risk-factors/)
* [What do we know about vaccines and therapeutics? What has been published concerning research and development and evaluation efforts of vaccines and therapeutics?](https://www.kaggle.com/pmlee2017/use-semantic-similarity-therapeutics/)
* [What has been published about medical care?](https://www.kaggle.com/pmlee2017/use-semantic-similarity-medical-care)
* [What do we know about non-pharmaceutical interventions?](https://www.kaggle.com/pmlee2017/use-semantic-similarity-non-pharma-interv)
* [What do we know about diagnostics and surveillance?](https://www.kaggle.com/pmlee2017/use-semantic-similarity-diagnostics-surveillance) 
* [What has been published about ethical and social science considerations?](https://www.kaggle.com/pmlee2017/use-semantic-similarity-ethics) 
* [What has been published about information sharing and inter-sectoral collaboration?](https://www.kaggle.com/pmlee2017/use-semantic-similarity-sharing-collaboration) 

Enjoy the read! 😃

# References
* [Ipywidgets](https://ipywidgets.readthedocs.io/en/stable/)
* [Universal Sentence Encoder](https://arxiv.org/abs/1803.11175)
* [Colab for USE embeddings](https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb#scrollTo=K_3uevjRUgpo)