In [None]:
# Install required pip packages. Please note that the internet must be switched on to install them in the Kaggle kernel.
!pip install -U kneed

In [None]:
# Imports
import os
from os.path import join as join_path
import numpy as np
rng_seed = 368
np.random.seed(rng_seed)
import pandas as pd

# Silence NumbaPerformanceWarning for UMAP
import warnings
from numba.errors import NumbaPerformanceWarning
warnings.filterwarnings("ignore", category=NumbaPerformanceWarning)
import umap
from sklearn.cluster import KMeans

from kneed import KneeLocator
from scipy.spatial.distance import pdist
from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec
import plotly.express as px

from IPython.display import IFrame

# Coordle - Search Engine using Word2Vec and TF-IDF
Due to the special circumstances of the COVID-19 pandemic, the students of the Selected Topics in Machine Learning (topic being "Deep Learning") course ([INF368 Spring 2020](https://www.uib.no/en/course/INF368?sem=2020v)) at the University of Bergen were asked to participate in the competition.

In this notebook, you will find a search engine for the articles in the CORD-19 dataset. We named it Coordle (from Google + CORD) and was made using TF-IDF and Word2Vec. The search engine can be found in the interactive cell below, or by clicking here: https://coordle.triki.no/.

In [None]:
%%html
<center><iframe src="https://coordle.triki.no/" width="800" height="600" frameborder="0" allowfullscreen/></center>

### Table of contents
1. Installing the Coordle library
2. Data preprocessing
3. Creating word embeddings from scratch using Gensim
4. Visualize word embeddings using UMAP
5. Creating a search engine using TF-IDF
6. Task results
7. Future work

## 1. Installing the Coordle library
We have separated the code into two Github repositories. [The first one](https://github.com/JonasTriki/inf368-exercise-3-cord-19) is used for the data preprocessing and experimentation. [The second repository](https://github.com/JonasTriki/inf368-exercise-3-coordle) is where the Coordle library is maintained at and which we will use throughout the notebook. To run the cell below, please note that the internet must be switched on. This is to install the Coordle library.

In [None]:
!pip install -U https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz
!pip install -U git+https://github.com/JonasTriki/inf368-exercise-3-coordle.git
    
# Import Coordle modules
from coordle.preprocessing import CORD19Data
from coordle.utils import clean_text
from coordle.backend import QueryAppenderIndex

## 2. Data preprocessing
To load and preprocess the CORD-19 data, we take inspiration from [Daniel Wolffram's "CORD-19: Create Dataframe" Notebook](https://www.kaggle.com/danielwolffram/cord-19-create-dataframe) and the ["Date updates thread"](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/137474) from the challenge itself. The goal of the data preprocessing is to get a single .csv file with all the cleaned/parsed data in place.

In particular, we first load the `metadata.csv` file using Pandas and perform some cleaning on it (dropping duplicates and articles without metadata). Then, we go through each and every row of the .csv file and parse it. We ensure that each row has either a PDF or PMC parse, and we prefer the PMC over PDF articles. For the body text of each article, we remove cite spans from the text since they are useless for creating word embeddings. We observed there were some false positive articles that had less than around 1000 characters in the body text and we exclude these.

Next, we remove duplicate articles that have the same abstract/body_text and we extract the language from the article using spaCy. We do this because we only would like to have english articles in our final dataframe. After this, we save the result using Pandas. For more details about how we preprocessed the data, please consult the [cord_19_data.py](https://github.com/JonasTriki/inf368-exercise-3-coordle/blob/master/coordle/preprocessing/cord_19_data.py) file from the Coordle library repository.

In [None]:
# Define some constants
kaggle_input_dir = join_path('/', 'kaggle', 'input')
cord_data_raw_dir = join_path(kaggle_input_dir, 'CORD-19-research-challenge')

In [None]:
# Perform preprocessing on the raw data
cord_df = CORD19Data(cord_data_raw_dir).process_data()

In [None]:
#  Sanity check the processed dataframe
cord_df.head()

## 3. Creating word embeddings from scratch using Gensim
To create the word embeddings for the CORD-19 dataset, we used Gensim and the `Word2Vec` class. However, before we can train the model we first define some helper classes. In particular, we define a data interator that yields sentences for Word2Vec to train on and a callback that saves an intermediate model after each epoch. The data iterator uses the `clean_text` function from the Coordle library. In short, it cleans the text by turning it into lowercase, removing punctuations, stopwords, numerics and words with one character. At last, it lemmatizes the text (turning the word `viruses` into `virus` for instance). The code for the function can be found [by clicking here](https://github.com/JonasTriki/inf368-exercise-3-coordle/blob/master/coordle/utils/utils.py#L40).

In [None]:
# Implement the data interator for Word2Vec
class CORDDataIteratorWord2Vec():
    def __init__(self, texts: np.ndarray):
        self.texts = texts
    
    def __iter__(self):
        for text in self.texts:
            sentences = nltk.tokenize.sent_tokenize(text)
            cleaned_sentences = [clean_text(sent) for sent in sentences]
            for sentence in cleaned_sentences:
                yield sentence

In [None]:
# Implement the epoch saver for Word2Vec
class EpochSaver(CallbackAny2Vec):
    '''Callback to save model after each epoch.'''

    def __init__(self, output_dir: str, prefix: str, start_epoch: int = 1):
        self.output_dir = output_dir
        self.prefix = prefix
        self.epoch = start_epoch

    def on_epoch_end(self, model):
        output_path = join_path(self.output_dir, f'{self.prefix}_epoch_{self.epoch}.model')
        model.save(output_path)
        self.epoch += 1

After we have defined these two classes, we train the model in three steps:
1. Initialize Word2Vec model
2. Build Word2Vec vocabulary
3. Train the model

We split the steps into three parts to further sanity check that we did not make any mistakes on the way.

This is illustrated in the code below and takes around ~10 hours to run. For your converience, we have imported the final model/weights into the kernel in the `input/gensim-word2vec-model` folder after running for 20 epochs.
```python
# Extract English only texts
cord_df_eng = cord_df[cord_df['language'] == 'en']
eng_texts = cord_df_eng['body_text'].values

cord_sentences = CORDDataIteratorWord2Vec(eng_texts)
w2v_saved_models_dir = 'models-word2vec'
saved_models_prefix = 'model'

# 1. Setup initial model
w2v_model = Word2Vec(
    min_count=20,
    window=2,
    size=300,
    negative=5,
    callbacks=[EpochSaver(w2v_saved_models_dir, saved_models_prefix)]
)

# 2. Build vocabulary
w2v_model.build_vocab(tqdm(cord_sentences, total=cord_num_sentences), progress_per=int(cord_num_sentences / 100))

# 3. Train model
w2v_model.train(
    cord_sentences,
    total_examples=w2v_model.corpus_count,
    epochs=20,
    report_delay=30
)
```

In [None]:
# Load the trained Gensim model
model_path = join_path(kaggle_input_dir, 'gensim-word2vec-model', 'cord-19-w2v.model')
w2v_model = Word2Vec.load(model_path)
word_embedding_matrix = w2v_model.trainables.syn1neg

### Test word embeddings by finding most similar word vectors

In [None]:
w2v_model.wv.most_similar('covid')

In [None]:
w2v_model.wv.most_similar('virus')

In [None]:
w2v_model.wv.most_similar('pandemic')

We observe above that the word embeddings do indeed make sense, when exploring a few examples. Next we will visualize the embeddings as well to get a deeper understanding of it.

## 4. Visualize word embeddings using K-means clustering and UMAP
We decided to use K-means clustering and UMAP to cluster and reduce the dimentionality of the word embeddings. This part is mainly as a sanity check to see that the word embeddings we have gotten from the Word2Vec algorithm actually make sense. To find the best number of clusters for K-means, we use the elbow method.

In [None]:
# Cluster
min_k = 2
ks = np.arange(min_k, 21)
errors = np.zeros(len(ks))
clusterings = np.zeros((len(ks), word_embedding_matrix.shape[0]))
for k in ks:
    print(f'Clustering using k={k}...')
    clusterer = KMeans(n_clusters=k, n_jobs=-1)
    pred_labels = clusterer.fit_predict(word_embedding_matrix)
    clusterings[k - min_k] = pred_labels
    errors[k - min_k] = clusterer.inertia_

In [None]:
# Show the elbow plot to determine the best k
kneedle = KneeLocator(ks, errors, S=1.0, curve='convex', direction='decreasing')
kneedle.plot_knee()

# Select best clustering
best_clustering = clusterings[kneedle.knee - min_k]

In [None]:
# Reduce dimensionality using UMAP (with default params)
word_embedding_3d = umap.UMAP(n_components=3).fit_transform(word_embedding_matrix)

In [None]:
# Visualize the words in 3D with Plotly
word_embedding_vis_df = pd.DataFrame({
    'x': word_embedding_3d[:, 0],
    'y': word_embedding_3d[:, 1],
    'z': word_embedding_3d[:, 2],
    'cluster_label': best_clustering,
    'word': w2v_model.wv.index2word
})
fig = px.scatter_3d(word_embedding_vis_df, x='x', y='y', z='z', color='cluster_label', hover_name='word')
fig.show()

By zooming into some of the smaller clusters, we observe that months such as feburary and august are clustered together. We also observe that we get some clusters with date related words and words that represent temperatures, as well as some outliers here and there, which is not too unexpected.

## 5. Creating a search engine using TF-IDF
- Intro
- How we did it
- Explain set-operators AND, OR, NOT
- Combining with word embeddings

In [None]:
# To demonstrate how the search engine works, we index on a subset of the documents in the CORD-19 dataframe.
ai_index = QueryAppenderIndex(w2v_model.wv.most_similar, n_similars=1)
ai_index.build_from_df(
    cord_df[:1000],
    'cord_uid',
    'title',
    'body_text', 
    verbose=True, 
    use_multiprocessing=True,
    workers=-1
)

In [None]:
def search_and_show(query: str, max_results: int = 5, max_body_length: int = 500):
    '''Searches using the AI Index and shows the result
    
    Args:
        query: Search query
        max_results: Max results to show for each query    
    '''
    docs, scores, errmsgs = ai_index.search(query)
    if errmsgs:
        print('The following errors occurred:', errmsgs)
    else:
        if len(docs) == 0:
            print('Sorry, no results found.')
        else:
            for doc, score in zip(docs[:max_results], scores[:max_results]):
                print(f'{doc.uid}  {str(doc.title)[:70]:<70}  {score:.4f}')
                print('---')
                print(f'{cord_df[cord_df.cord_uid == doc.uid].body_text.values[0][:max_body_length]} {...}')
                print('---')

In [None]:
search_and_show('virus')

In [None]:
search_and_show('virus AND')

In [None]:
search_and_show('coronavirus symptoms in humans')

## 6. Task results
Due to computational limitations, we only indexed a few of the 35k+ articles in our dataset. To show the task results, we simply performed the following queries on the live website which has all articles indexed. For simplicity reasons, we only show the top 3 results for each query.
1. What do we know about COVID-19 risk factors?
    - covid AND risk AND factors
2. What do we know about vaccines and therapeutics?
    - covid AND vaccines
    - covid AND therapy
3. What has been published about medical care?
    - covid AND medical care
4. What do we know about diagnostics and surveillance?
    - covid AND diagnostics
    - covid AND surveillance

### 6.1. What do we know about COVID-19 risk factors?
![](https://i.imgur.com/j56hQIT.png)

### 6.2. What do we know about vaccines and therapeutics?
![](https://i.imgur.com/nSzQdDT.png)
![](https://i.imgur.com/w1YQ8Yd.png)

### 6.3. What has been published about medical care?
![](https://i.imgur.com/upnLRTU.png)

### 6.4. What do we know about diagnostics and surveillance?
![](https://i.imgur.com/V8jrFc9.png)
![](https://i.imgur.com/vai3KW0.png)

## 7. Future work
- Database indexing instead of in-memory
- Highlighting of search query words
- More (/better?) suggestions using Doc2Vec
- cat AND dog AI Index