# Introduction

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
import torch
import helper_functions as hf
import meta_cleaning as mc
import eda_text as et
import config 
import viz_plot as vp
import word_cloud_prep as wcp
import covid_clustering as  cc
import biobert_embedding as be
import spacy
import plotly.express as px
from collections import defaultdict
from timeit import default_timer as timer
from IPython.display import HTML, display
import tabulate

spacytokenizer = spacy.tokenizer.Tokenizer(be.nlp.vocab)

# Any results you write to the current directory are saved as output.
ROOTDIR = "../input"
DATADIR = os.path.join(ROOTDIR, 'CORD-19-research-challenge')

> **"Nothing in life is to be feared, it is only to be understood. Now is the time to understand more, so that we may fear less." **
--Marie Curie

In the midst of this crisis, we are a team of data scientists who would like to put our skills into good use, to shed light by providing a tool to the scientific community using NLP to answer key questions from the scientific literature. 

This tool we are building here is a search engine based on the similarity score calculated on embeddings via BioBERT ([BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://github.com/dmis-lab/biobert)). We also provide a clustering among all articles so that users can get some insights by navigating between articles that are similar.

We attempt to write a kernel that is more like an article, instead of drown by codes. If you want to get your hand dirty, please feel free to have a look at the enclosed utility scripts. 


# Parsing
We first would like to include information from the json files to get a complete dataset. The following steps parse documents from Biorxiv, comm use subset and non-comm use subset as indicated the CORD-19-research-challenge.

In [None]:
df_meta = pd.read_csv(os.path.join(DATADIR, "metadata.csv"))

**Parsing Biorxiv articles**

In [None]:
biorxiv_path = os.path.join(DATADIR, "biorxiv_medrxiv/biorxiv_medrxiv/pdf_json/")
df_biorxiv = mc.parse_biorxiv(biorxiv_path)

**Parsing Comm use subset articles**

In [None]:
comm_subset_path = os.path.join(
    DATADIR, "comm_use_subset/comm_use_subset/pdf_json/"
)
df_comm = mc.parse_comm(comm_subset_path)


**Parsing Non Comm use subset articles**

In [None]:
noncomm_subset_path = os.path.join(
    DATADIR, "noncomm_use_subset/noncomm_use_subset/pdf_json/"
)
df_noncomm = mc.parse_noncomm(noncomm_subset_path)

**Merge dataset**

We proceed by merging all the dataset using the unique key "sha", and check if there are some duplicates in our dataset.

In [None]:
df_meta.columns

In [None]:
# Convert publish time to type publish_date 
df_meta["publish_date"] =  pd.to_datetime(df_meta["publish_time"])
df_merge = mc.merge_datasets(df_meta, df_biorxiv, df_comm, df_noncomm)
df_merge_impute = mc.impute_columns(df_merge)
df_meta_comp = mc.drop_duplicates(df_merge_impute)

A peek on the completed meta dataset.

In [None]:
df_meta_comp.head()
df_meta_comp.shape

# EDA
## Understanding variables 
Before working on the search engine, we perform some data exploration to get a hint of what data are awaiting us.

We first have a look at the missing values. This helps us first understand the variables; and subsequently help us in choosing the variables to work with.

In [None]:
mc.plot_missing_value_barchart(df_meta_comp)

From the above count plot, we can identify the meaning behind each variable as follows, most of them are self-explanatory:

* ID numbers:
    * cord_uid,
    * doi,
    * pubmed_id (has been exlcuded),
    * pmcid (has been exlcuded),
    * WHO #Covidence (has been exlcuded),
    * Microsoft Academic Paper ID (has been exlcuded),
    * sha (only this is important to retrieve data from jsons)
* url: link to the article
* source_x: source of the article
* has_full_text: (boolean) indicate whether the article has full text
* journal
* authors_list
* first_author: the first ofauthor of the article
* last_author: the author who is in the last position, corresponds normally to the chief of the research institute
* affiliations
* bibliography
* raw_bibliography : bibliography in its raw position
* title
* abstract
* text
* publish_date
* full_text_file: path to the json file (has been excluded)

A lot of them are related to IDs, here we are only interested in one which is sha which will act as a unique key to retrieve the articles. Among other variables that are useful are url for documentation purpose, source_x, journal, authors, title, abstract and publish_time.

Note that there are several articles (15 of them) that do not contain title. By right, they should be exlcuded from this dataset as they are not informative.

## On the authors

In [None]:
df_authors = df_meta_comp["authors"].apply(mc.author_feats)
mc.plot_num_author_distrib(df_authors["num_authors"])

Nowadays, an article is often an contribution of several people, most likely within a range of 2 to 30.

## On the sources

In [None]:
mc.plot_article_sources_distrib(df_meta_comp)

We observe there are many contributions for PubMed (PMC) and Elsevier, and the rest not as many, it might be dependant on the establishment of these journals. Here is summary of these journals.
- **PubMed Central® (PMC)**: free full-text archive of biomedical and life sciences journal literature at the U.S, about 5.9 MILLION Articles are archived in PMC.
- **Elsevier**: Dutch publishing and analytics company specializing in scientific, technical, and medical content.
- **WHO**: World Health Organization
- **biorxiv**: the preprint server for biology. 
- **merdxiv**: the preprint server for medical health science
- **CZI**: Chan Zuckerberg Initiative

## On the publish date

In [None]:
df_publish_date = mc.groupby_publish_date(df_meta_comp)
mc.plot_publish_date_distrib(df_publish_date)

We observe that there are spikes of publication at the end of the year. Do they only concern certain journals?

Looking at this time plot, we strongly suspect that there is a dependance between the publish date and the sources, let's have a look at that to confirm this hypothesis. We shall only look at the articles from 2002 onwards, as articles before this year are not as frequent.

In [None]:
df_date_source = mc.gropuby_date_source(df_meta_comp)
mc.plot_publish_date_wrt_sources(df_date_source)

While we see the publication of several articles fairly recently, eg WHO, medrxiv and CZI, PubMed and Elsevier on the other hand have contributions since 2002, and they are the ones which have spikes of publication at the end of the year.

For publications prior to 2002, topics are mostly about on ILI (Influence-like illnesses) as SARS virus has not known yet.

In the coming days/weeks, we'll see an avalanche of papers submitted to biorxiv and medrxic as they don't require any peer reviewing to get published. That implies that the to-be-developed search engine has to be scalable.

## EDA on text data
**On the title**

We perform a light EDA on text data, the NLP preprcocessing pipeline includes the following steps:
- lowercase
- remove puncuations
- remove https
- remove stopwords

In [None]:
df_preprocess_title = df_meta_comp["title"].apply(
        lambda x: et.nlp_preprocess(x, config.PUNCT_DICT)
    )
    
df_wc_title = et.corpora_freq(df_preprocess_title)
et.plot_distrib(df_wc_title, "title");

We observe that the top words are dominated by influenza-like illnesses and COVID-19 related vocabularies, eg virus, respiratory, coronavirus. It is also noted that the vocab is also related to research terms, such as clinical, study, analysis. There is no doubt that we are dealing with COVID-19 literature.

**On the abstract**

In [None]:
df_preprocess_abstract = df_meta_comp["abstract"].apply(
        lambda x: et.nlp_preprocess(x, config.PUNCT_DICT)
    )
df_wc_abstract = et.corpora_freq(df_preprocess_abstract)
et.plot_distrib(df_wc_abstract, "abstract");


The word count plot above shows a richer vocab that does not always seem specified to the medical research literature at hand, eg also, may, can, however, not, abstract ... By right, if we were handling the corpus with the bag-of-words approach, these words should be removed. This plot has been insightful in allowing us to build our strategy in our following studies.

**On the text**

In [None]:
df_preprocess_text = df_meta_comp["text"].apply(
        lambda x: et.nlp_preprocess_text(x, config.PUNCT_DICT)
)
df_wc_text = et.corpora_freq(df_preprocess_text)
et.plot_distrib(df_wc_text, "text");

If the abstract has provided us with some uninformative words, we observe that this type of vocab is even more present in the text.

## On the affiliation

In [None]:
df_process_affiliation = df_meta_comp["affiliations"].apply(et.process_affiliations)

df_wc_affiliation = et.corpora_freq(df_process_affiliation, affiliation=True)
ax = et.plot_distrib(df_wc_affiliation.iloc[1:], "affiliation")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);

The top 10 institutitions that have contributed to this dataset are mostly Chinese institutions. This makes sense as China is where the first-wave of COVID-19 has taken place. In summary, the contribution to this dataset is a global effort.

# Embedding
In this section, we will proceed with the embeddings of our text data. The idea of embedding is to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The embedded data are also in a lower dimension as compared to only using classical methods, such as count or tfidf vectorizations Once we have the embeddings, we use them as latent variables for further processing, eg clusterisation, similarity computation with embeded queries... Here, we focus on the aforementioned tasks.

Now comes the question of what embedding to be used. We started of doing a literature research on the most recent NLP embeddings, and BERT [Attention Is All You Need
](https://arxiv.org/abs/1706.03762) comes into our mind. Further research led us to SciBERT [SciBERT: A Pretrained Language Model for Scientific Text
](https://arxiv.org/abs/1903.10676) and BioBERT. We have therefore performed a comparison between BERT, sciBERT and bioBERT embeddings. We obtained more relevant results using bioBERT (results not included in this notebook). What we'll be doing here is to perform a comparison between only using title and title with abstract, hence testing the sensitivity BioBERT embeddings to the length of the text.

## BERT in a nutshell
There are some key elements a reader ought to know while using "BERT"-like embeddings.
- BERT has about ~30k words/subwords (wordpiece embeddings) as vocabulary, an input is mapped to these words/subwords.
- Little or no preprocessing is required as the BERT tokenizer will intelligetnly map these words into the vocab, ie common words such as "the", or slightly uncommon ones such as "quantum", "constantinopole" are present in BERT vocabulary, a direct mapping is therefore carried out. However the word "aerodynamics" is absent, it is broken down into 4 subwords: electro ##dy ##nami ##cs where these subwords are present in the vocab. 
- Some mandatory tokens must be included such as [CLS] stands for class and often included at the beggining of a sentence and [SEP] stands for separation for the next sentence prediction task, as the underlying model is an attention-based bidirectional RNN structure.
- The output embeddings have a fixed size of (x, 768), where x is the number of tokens which we set to a maximum length according to the corpus.

## Load BioBERT model
BioBERT model can be obtained from this [GitHub](https://github.com/dmis-lab/biobert) page. We use transformers from HuggingFace to handle our model because of its simplicity, with just a few lines of code, we have the embedding module up and running.

⚠️ **We have previously encountered some RAM issues in generating embeddings. Therefore this part of the code has taken into account batch computing to not overload RAM. **

In [None]:
from transformers import BertTokenizer, BertModel

model_version =  os.path.join(ROOTDIR, "biobert/biobert")
do_lower_case = True
model = BertModel.from_pretrained(model_version)
tokenizer = BertTokenizer.from_pretrained(
    model_version, do_lower_case=do_lower_case
)

# define some parameters
nsamples = 2000
nsize = 2000

Since this work is just a first look of our [website](https://covid19.ai2prod.com/), we will randomly sample a handful of articles, feel free to click on the link to test our search engine. 

## BioBERT on title


In [None]:
title_len = df_meta_comp["title"].apply(lambda x: len(x.split()) if not pd.isnull(x) else np.NaN)
mean_title_len = np.nanmean(title_len)
print("Mean title length: %.1f words" %mean_title_len)

In [None]:
df_title = df_meta_comp["title"].copy()

In [None]:
print("Processing:%d entries" % nsize)

In [None]:
# you can reshuffle
#shuffle_df_title = df_title#.sample(frac=1)

start = timer()
df_embedding_title = be.embed_loop(df_title, nsize, model, tokenizer, config.STOPWORDS, max_len = 48);
dt = timer() - start


In [None]:
print("\n\nCalculation done in %f s" % dt)


The embedding of title is pretty straightforward, as it normally contains a mean of ~12.5 words. 

### Clustering
With embeddings done, we can proceed with clustering. We use MiniBatch Kmeans to cluster our datapoints, this again goes with our aim of performing all computations in batch. Then we perform a PCA projection on 3D. 

In [None]:
n_clusters = np.arange(10, 38, 2)
subtext = "_title"
df_clusters = cc.miniBatchClustering(df_embedding_title, None, n_clusters)

To determine the optimal number of clusters, we use the Silhouette score to measure how similar an object is to its own cluster (cohesion) compared to other clusters (separation), and this metric gives a value between -1 to 1 with -1 meaning the datapoint does not belong to the attributed cluster at all, and 1 indicating good clusterisation.

In [None]:
import matplotlib.pylab as plt
fig, axes = plt.subplots(3, 3, figsize=(12, 16))

cc.plot_silhouette_graph(
        df_embedding_title,
        df_clusters,
        n_clusters,
        fig,
        axes,
    )

From the average Silhouette score, we observe that the value is rather low, but again this score should not be the hard and fast rule in determining the optimum number of clusters. Looking closely to the Silhouette plot, we observe that a lot of the datapoints are wrongly attributed but their Silhouette scores are contained between -0.1 and 0 signifying they are not that badly attributed. Of course, looking at the positive side, the Silhouette scores are not that high either. 

For low number of clusters (4-8), we observe that cluster size is rather uneven, and potentially they need to be divided into more sub clusters. Hence higher number of clusters is prioritised.

The fact that many datapoints are attributed to the wrong cluster might indicate that a lot of articles are cross-discipline, and therefore don't belong to only cluster. 

From this analysis, we have seen that the Silhouette graph is not very conclusive. We pursue by measuring another metric which is the Calinski-Harabasz score, it is defined as ratio between the within-cluster dispersion and the between-cluster dispersion.


In [None]:
df_calinski = cc.evaluate_metric_score(df_embedding_title, df_clusters, metric="calinski")
df_calinski["n_cluster"] = n_clusters

In [None]:
ax = cc.plot_metric_vs_cluster(df_calinski)
x = df_calinski['n_cluster'].values
y = df_calinski['metric'].values

a1 = (y[5]-y[0])/(x[5]-x[0])
b1 = 1100

def f(x, a, b):
    return a*x+b

x_asympt= np.arange(10., 32.)
y1 = f(x_asympt, a1, b1)

ax.axhline(350,  color="black", linestyle="dashed", linewidth=1)
ax.axvline(25, ymax=.6, color="red", linestyle="dashed", linewidth=1)

ax.plot(x_asympt, y1, color="black", linestyle="dashed", linewidth=1)
ax.set_ylabel("Calinksi-Harabasz score");


By plotting the Calinski-Harabasez score wrt number of clusters, we observe an elbow. Though it's not a very precise method of determining the optimal number of clusters, but it at least provides us with some elements of response. The optimal number of clusters determined here is 30. 


### Visualization
Another way to check if our datapoints are being correctly clustered, we can visualize in lower dimensions. We choose to perform a PCA with 3 components to have a projection in 3D. Before carrying out the PCA projection, we sphereize our datapoints (the embeddings) by shifting each datapoint by the centroid and make it unit norm. By doing this the datapoints will be better distributed in the spherical space.

**Give it a spin at the 3D plot, and hover over the datapoints to get some information.**

In [None]:

X = vp.sphereize(df_embedding_title.values)
index = df_embedding_title.index
colors = px.colors.qualitative.Alphabet + px.colors.qualitative.Plotly
df_pca = vp.compute_pca(X, index)
df_resampled = vp.resample(df_pca, df_meta_comp, df_clusters, colors, nsamples=nsamples, n_cluster=24)
info_title = vp.prepare_info(df_resampled)
vp.plot_tsne(df_resampled, "X_0", "X_1", info_title, var_z="X_2")


We observe that the clustering is rather coherent. In order to evaluate the pertinence of clusters, our next step is to seek experts.

### Wordcloud
We further plot the wordcloud for each cluster to help us with the interpretation. Since the vocab of BioBERT are wordpieces, it doesn't make too much sense to use them for our wordcloud. Therefore, we proceed with the conventional NLP preprocessing pipeline: Lemmatokenization, Stopword removal, Punctuation removal, and Count vectorization. 

We first perform a preprocessing on the general corpus to identify the top words and then consider them as stopwords.

In [None]:
df_title_cluster = df_clusters.merge(
        df_title, left_index=True, right_index=True
    )

In [None]:
df_title_cluster["title_process"] = df_title_cluster["title"].apply(
        lambda x: et.nlp_preprocess(str(x), config.PUNCT_DICT)
        if not pd.isnull(x) else "")

df_title_wc = wcp.prepare_word_cloud(df_title_cluster["title_process"])
# get rid of top 10 words
extra_stopwords = df_title_wc.head(10).keys().tolist()

extra_stopwords += ["covid", "sars", "infectious", "19", "volume", "index", "chapter", "volume", "1"]

Now we loop over all clusters and preprocess the text to build the vocab for wordcloud representation.


In [None]:
wc_title = defaultdict(int)
ncluster = 24
nclusters = sorted(
        df_title_cluster["labels_cluster_%d" % ncluster].unique().tolist()
    )

for k in nclusters:
    temp = df_title_cluster[df_title_cluster["labels_cluster_%d" % ncluster] == k][
        "title_process"
    ]

    try:
        wc_title[k] = wcp.prepare_word_cloud(temp, extra_stopwords)
    except ValueError:
        pass

n_top = 50
fig, axes = plt.subplots(6, 4, figsize=(24, 18))
ax = axes.flatten()
wcp.plot_word_cloud(wc_title, ax)
plt.tight_layout(w_pad=4.0)

Above each word cloud is the top 3 word occurences. Topics are quite distinctive knowing that we have only used title as input. Among the topics, we have syndromes, detection, genetics (rna), control and diagnosis, surveillance...


## BioBERT on title and abstract
For this task, we concatenate title and abstract. Since BERT tokenization only allows at most 512 tokens, lengthy text will be truncated. The following procedure ressembles that of embedding on title.

In [None]:
 df_title_abstract = df_meta_comp[["title", "abstract"]].apply(
            be.join_title_abstract, axis=1
    )
    

In [None]:
title_abstract_len = df_title_abstract.apply(lambda x: len(x.split()) if not pd.isnull(x) else np.NaN)
mean_title_abstract_len = np.nanmean(title_abstract_len)
print("Mean title abstract length: %.1f words" %mean_title_abstract_len)

In [None]:
print("Processing:%d entries" % nsize)
#shuffle_df_title_abstract = df_title_abstract.sample(frac=1)

In [None]:
start = timer()
df_embedding_title_abstract = be.embed_loop(df_title_abstract, nsize, model, tokenizer, config.STOPWORDS, max_len = 256);
dt = timer() - start


In [None]:
print("\n\nCalculation done in %f s" % dt)

In [None]:
n_clusters = np.arange(4, 30, 2)
df_clusters_title_abstract = cc.miniBatchClustering(df_embedding_title_abstract, None, n_clusters)

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(12, 16))

cc.plot_silhouette_graph(
        df_embedding_title_abstract,
        df_clusters_title_abstract,
        n_clusters[-10:],
        fig,
        axes,
    )

These Silhouette plots look a little bit different as compared to only using title, the values are higher and the erros in cluster attribution seem to be lower. Let's have a look at the Calinski-Harabasz score.

In [None]:
df_calinski_title_abstract = cc.evaluate_metric_score(df_embedding_title_abstract, df_clusters_title_abstract, metric="calinski")
df_calinski_title_abstract["n_cluster"] = n_clusters

In [None]:

ax = cc.plot_metric_vs_cluster(df_calinski_title_abstract)
x = df_calinski_title_abstract['n_cluster'].values
y = df_calinski_title_abstract['metric'].values

a1 = (y[4]-y[3])/(x[4]-x[3])
b1 = 1350

def f(x, a, b):
    return a*x+b

x_asympt= np.arange(5., 26.)
y1 = f(x_asympt, a1, b1)

ax.axhline(400,  color="black", linestyle="dashed", linewidth=1)
ax.axvline(21, ymax=.5, color="red", linestyle="dashed", linewidth=1)
ax.plot(x_asympt, y1, color="black", linestyle="dashed", linewidth=1)

ax.set_ylabel("Calinski-Harabasz score");

### Visualization 
Projection with PCA.

In [None]:
X_title_abstract = vp.sphereize(df_embedding_title_abstract.values)
index_title_abstract = df_embedding_title_abstract.index
colors = px.colors.qualitative.Alphabet
df_pca_title_abstract = vp.compute_pca(X_title_abstract, index_title_abstract)
df_resampled_title_abstract = vp.resample(df_pca_title_abstract, df_meta_comp, 
                                          df_clusters_title_abstract, colors, nsamples=nsamples )
info_title_abstract = vp.prepare_info(df_resampled_title_abstract)
vp.plot_tsne(df_resampled_title_abstract, "X_0", "X_1", info_title_abstract, var_z="X_2")

### Wordcloud

In [None]:
df_title_abstract.name = "title_abstract"
df_title_abstract_cluster = df_clusters_title_abstract.merge(
        df_title_abstract, left_index=True, right_index=True
    )
df_title_abstract_cluster["title_abstract_process"] = df_title_abstract_cluster["title_abstract"].apply(
        lambda x: et.nlp_preprocess(str(x), config.PUNCT_DICT)
        if not pd.isnull(x) else "")

In [None]:
df_title_abstract_wc = wcp.prepare_word_cloud(df_title_abstract_cluster["title_abstract_process"])
# since there are more words, we consider the top 20 words as stop words
extra_stopwords_title_abstract = df_title_abstract_wc.head(20).keys().tolist()

extra_stopwords_title_abstract += ["covid", "sars", "infectious", "19", "may", "can", "volume", "index", "chapter", "volume", "used", "also"]

In [None]:
wc_title_abstract = defaultdict(int)
ncluster = 20
nclusters = sorted(
        df_title_abstract_cluster["labels_cluster_%d" % ncluster].unique().tolist()
    )

for k in nclusters:
    temp = df_title_abstract_cluster[df_title_abstract_cluster["labels_cluster_%d" % ncluster] == k][
        "title_abstract_process"
    ]

    try:
        wc_title_abstract[k] = wcp.prepare_word_cloud(temp, extra_stopwords_title_abstract)
    except ValueError:
        pass

n_top = 50
fig, axes = plt.subplots(5, 4, figsize=(24, 18))
ax = axes.flatten()
wcp.plot_word_cloud(wc_title_abstract, ax)
plt.tight_layout(w_pad=4.0)

As compared to the previous study using only title, we observe that the vocab is richer. The topics are also very distinctive. We observe that there are topics related to treatment and vaccines, the genes, the outbreak in China, organization and research...

# Demo
In order to have an idea on how the difference between the two studies, we have prepared a demo to have a sense of the performance using ipywidgets.

In [None]:
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from ipywidgets import interact, interact_manual, Textarea, Layout

In [None]:
text_area_layout = Layout(width="70%", height="50px")
text_area = widgets.Textarea(value="Range of incubation periods for the disease in humans and how long individuals are contagious, even after recovery.", placeholder="Enter your text here.", layout=text_area_layout)

int_slider_layout = Layout(width="50%")
int_slider = widgets.IntSlider(description="Select number of results to show",
                               min=1, 
                               max=40, 
                               value=10, 
                               layout=int_slider_layout,
                               style={'description_width': 'initial'}
                              )

radio_buttons_layout = Layout(width="50%")
radio_buttons = widgets.RadioButtons(description="select embeddings", 
                                     value='title', 
                                     options=['title', 'title+abstract'],
                                     style={'description_width': 'initial'},
                                     layout=radio_buttons_layout
                                    )

toggle_button = widgets.ToggleButton(value=True)

checkbox = widgets.Checkbox(value=False, description='Show abstracts', disabled=False, indent=False)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
@interact
def plot_search_results(emb=radio_buttons, n=int_slider, show_abstracts=checkbox, text=text_area):
    if text.split() !="":
        if emb == "title": 
            embs = df_embedding_title
            max_len = 48
        elif emb == "title+abstract":
            embs = df_embedding_title_abstract
            max_len = 256
        
        print(f"Displaying {n} most similar results for \n{text} ...\n")
        
        tokenized_text = be.custom_tokenize(text, config.STOPWORDS, tokenizer, spacytokenizer)
        embedding_text = be.embed_text(tokenized_text, model, tokenizer, max_len=max_len).mean(1).detach().numpy()
        #row_ids, ordered_dict = be.top_n_closest(embedding_text, embs, df_meta_comp)
        similarities = cosine_similarity(embs.values, embedding_text).reshape(1,-1)
        indices = np.argsort(similarities)[0]
        indices = indices[::-1][:n]
        row_ids = embs.iloc[indices].index
      
        for i, (row_id, index) in enumerate(zip(row_ids, indices)):

            title = df_meta_comp.loc[row_id]['title']
            abstract = df_meta_comp.loc[row_id]['abstract']
            print(f'result {i} title : {title}')
            print(f'similarity : {similarities[0][index]}')
            
            if show_abstracts:
                print('')
                print(f'result {i} abstract : {abstract}')

            print('----' )
    else:
        print('no query, no results baby.')

By testing with some queries, it is observed that the cosine similarities are very high in both cases. Regarding this, we understood that from this article [Semantic Similarity in Sentences and BERT](https://medium.com/analytics-vidhya/semantic-similarity-in-sentences-and-bert-e8d34f5a4677) that BERT is not trained for semantic sentence similarity directly like the Universal Sentence Encoder or InferSent models. Therefore, BERT embeddings cannot be used directly to apply cosine distance to measure similarity. That constitutes our next step.

And when the query is short, using title+abstract, we find that most of the returned responses do not have any abstract, this shows that it is probably sensitive to length.

# Conclusions
In this kernel, we have used BioBERT embeddings to embed our text data, and using this to calculate the similarity when a user enters a query. We have performed clustering on the datapints and have used PCA to reduce the number of dimensions and visualize them. With word cloud, we have been able to get an idea of the topic of each cluster. And finally, we build a demo to test out two settings, one embedding using title and the other one using both title and abstract. The returned results seem to be a little bit off.

# Next steps

Using BioBERT embeddings directly to identify the semantic similarity has shown some caveats: it is very sensitive to length and returned results are not necessary very relevant. We have since started exploring Universal Sentence Encoder to compute our embeddings, and it has shown some promising results. Stay tuned for more 😀.

# References
* [scibert-embeddings](https://www.kaggle.com/isaacmg/scibert-embeddings)
* [Ipywidgets](https://ipywidgets.readthedocs.io/en/stable/)
* [BioBERT](https://github.com/dmis-lab/biobert)
* [SciBERT](https://github.com/allenai/scibert)
* [Hugging Face Transformers](https://github.com/huggingface/transformers)