# Analysis of Covid-19 Risk factors by an opinion extraction strategy


### Intro

This notebook is the result of the collaborative work of a group of engineers at Atos/Bull Fr.

Our goal was to **overcomes the problem of quickly finding different opinions** about a given subjet. In fact, it can be very difficult to quickly get reliable information: many different points of view are represented in the medias as well as in the scientific litterature.

Instead of simply returning the most closest sentences to the query, we chose to **extract the diferent opinions**, which can be shared by the different groups of people working on a subject.

![Overview](https://raw.githubusercontent.com/MrMimic/covid-19-kaggle/master/images/kaggle_covid.png)

### How it works 

#### Database creation

All titles, abstracts and body texts of the dataset have been [inserted into an SQLite DB](https://github.com/MrMimic/covid-19-kaggle/blob/master/src/main/python/c19/database_utilities.py#L186) (only english articles for the moment). 

They have been preprocessed by using the [method we developed](https://github.com/MrMimic/covid-19-kaggle/blob/master/src/main/python/c19/text_preprocessing.py#L21). It will lower and stem the text, remove stopwords, remove numeric values, split texts into sentences and sentences into words.

A [word2vec](https://radimrehurek.com/gensim/models/word2vec.html) embedding and a [TF-IDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) models have been trained on this pre-processed corpus. Briefly, these models allow to get a fixed-length vector to represent each word of the corpus (word2vec) and to weight each word regarding it's frequency among all the corpus and in each document (TF-IDF). The result is a parquet table, [stored on Github](https://github.com/MrMimic/covid-19-kaggle/blob/master/resources/global_df_w2v_tfidf.parquet), containing for each word a float vector and a TF-IDF score.

![Table header](https://raw.githubusercontent.com/MrMimic/covid-19-kaggle/master/images/header_w2v_tfidf.jpg "Table header")

The file can be re-generated in more or less 30 minutes on a 8 vCPU machine by using [this script](https://github.com/MrMimic/covid-19-kaggle/blob/master/src/main/scripts/train_w2v.py).

Each sentence from the corpus have been pre-processed and [vectorized](https://github.com/MrMimic/covid-19-kaggle/blob/master/src/main/python/c19/embedding.py#L65). To do so, each pre-processed word from a sentence is represented by its vector and weithed by the TF-IDF score. Then, all vectors from the different words composing the sentence are averaged ([Mean of Word Embeddings](https://books.google.fr/books?id=tBxrDwAAQBAJ&pg=PA95&lpg=PA95&dq=mean+of+word+embedding+MOWE&source=bl&ots=7laX_HWKS0&sig=ACfU3U2DvGwGI6Bs4HTkX0_oP7Nf3UTP2A&hl=en&sa=X&ved=2ahUKEwiXguOJ9tjoAhX3D2MBHS6mAzoQ6AEwCnoECA0QKA#v=onepage&q=mean%20of%20word%20embedding%20MOWE&f=false)). All these pre-processed sentences are [stored in base](https://github.com/MrMimic/covid-19-kaggle/blob/master/src/main/python/c19/text_preprocessing.py#L151).

#### Query matching

The query is first [vectorised](https://github.com/MrMimic/covid-19-kaggle/blob/master/src/main/python/c19/query_matching.py#L57) by using the same strategy and tool as explained above. The cosine similarity of this sentence [versus all stored sentences](https://github.com/MrMimic/covid-19-kaggle/blob/master/src/main/python/c19/query_matching.py#L127) vectors is then computed. Briefly, it allows to check how each sentence of the dataset is close from the query. Only the top-k sentences are returned (filtered either by minimal distance or by a fixed number of top-k sentences).

All these top-k closest sentences are then clusterised by a Kmean algorithm. These clusters will represent the different opinions found about the query.

Only the closest sentence from each centroid is returned (*ie*, the sentence reflecting the most the opinion on this subject).

### What's cool

- The trained embedding is not generic. Even if pre-trained models found on the Internet work well, the context of covid-19 and the kind of sentences to be processed make a locally trained embedding better.
- Code is highly optimisez for RAM and rapid processing. Only the resulting DB weights gigabytes.
- Code is documented, PEP8 complient and installable as a Python library.
- The solution is highly portable (even on mobile with less sentences for example) due to the usage of SQLite.

### What's not

- The database containing all sentences weights more than 20Go. It is thus unusable on Kaggle. To overcome this issue, we had to select randomly 10 sentences from the body.

### What's next

Version 2.0 of this work will be released before the April, 15th. To come:

- Ranking best papers from opinions clusters regarding the authors and their background.
- Auto-estimate K for the number of opinions.
- Auto-test the code on Github with unitary tests on the methods.
- Maybe some interactive figures.
- Etc ;)

**And during the round #2, we would to develop:**

- A multi-lingual search (maybe with trained embedding on different languages instead of just translating the query).
- Use a larger pre-trained embedding (on the same corpus but maybe with some data augmentation from PubMed on the given subjects).
- Auto update of the newly published scientific litterature with a link to the Pubmed API.

### Usage

Queries from the different tasks have been reformulated and [stored in Github](https://github.com/MrMimic/covid-19-kaggle/blob/master/resources/queries.json). All of them have been sent to the pipeline and the result are store in markdown format here (TO COME).

For this notebook, we will focus on a given task and try to answer by using our tool.

### Setup

The library can be easily [installed from github](https://github.com/MrMimic/covid-19-kaggle/blob/master/setup.py).

In [1]:
# Install custom library from Github
!pip install -q --no-warn-conflicts git+https://github.com/MrMimic/covid-19-kaggle

import os
# Custom lib installed from github
from c19 import database_utilities, text_preprocessing, embedding, query_matching, parameters

# Ugly dependencies warnings
import warnings
warnings.filterwarnings("ignore")

Then, the parameters are loaded ([full explaination of the parameters](https://github.com/MrMimic/covid-19-kaggle/blob/master/src/main/python/c19/parameters.py)). 

Parameters() class returns default parameters which can be customised.

In [2]:
params = parameters.Parameters(
    first_launch=True,
    database=parameters.Database(
        local_path="local_database.sqlite",
        kaggle_data_path=os.path.join(os.sep, "kaggle", "input", "CORD-19-research-challenge")
    ),
    preprocessing=parameters.PreProcessing(
        max_body_sentences=10,
        stem_words=False
    ),
    query=parameters.Query(
        top_k_sentences_distance=0.8,
        filtering_method="distance"
    )
)

We construct the database by loading all title and abstract (as well as randomly chosen sentences from body to ensure that the SQLite database can be hosted on Kaggle).

In [3]:
database_utilities.create_db_and_load_articles(
    db_path=params.database.local_path,
    kaggle_data_path=params.database.kaggle_data_path,
    first_launch=params.first_launch,
    load_body=params.preprocessing.load_text_body)

41361 articles to be prepared.


PRE-PROCESSING: 100%|██████████| 41361/41361 [01:05<00:00, 627.15it/s]


Took 1.1 min to prepare 41361 articles for insertion.
Took 0.23 min to insert 41361 articles (SQLite DB: local_database.sqlite).


The pre-trained embeddings are loaded from GIthub. It can now return words vectors (which can be weighted by TF-IDF scores).

In [4]:
embedding_model = embedding.Embedding(
    parquet_embedding_path=params.embedding.local_path,
    embeddings_dimension=params.embedding.dimension,
    sentence_embedding_method=params.embedding.word_aggregation_method,
    weight_vectors=params.embedding.weight_with_tfidf)

Took 0.58 min to load 48539 Word2Vec vectors (embedding dim: 100).


The sentences are pre-processed, vectorised and inserted into the SQLite database.

In [5]:
text_preprocessing.pre_process_and_vectorize_texts(
    embedding_model=embedding_model,
    db_path=params.database.local_path,
    first_launch=params.first_launch,
    stem_words=params.preprocessing.stem_words,
    remove_num=params.preprocessing.remove_numeric,
    batch_size=params.preprocessing.batch_size,
    max_body_sentences=params.preprocessing.max_body_sentences)

41361 files to pre-process (42 batches of 1000 articles).


PRE-PROCESSING: 100%|██████████| 42/42 [33:54<00:00, 48.43s/it]


Took 34.38 min to pre-process 42 batches of articles.
Took 0.38 min to insert 612928 sentences (SQLite DB: local_database.sqlite).


The database is ready to be used.

### Analyse: covid-19 risk factors study

In [6]:
full_sentences_db = query_matching.get_sentences_data(db_path=params.database.local_path)

Queries will be match versus 612916 sentences (1.7 minutes to load).


In [7]:
query = "What do we know about Chloroquine to treat covid-19 induced by coronavirus?"

In [8]:
closest_sentences_df = query_matching.get_k_closest_sentences(
    query=query,
    all_sentences=full_sentences_db,
    embedding_model=embedding_model,
    number_threshold=params.query.top_k_sentences_number,
    distance_threshold=params.query.top_k_sentences_distance,
    filtering_method=params.query.filtering_method)

Took 0.21 minutes to process the query (447 sentences kept by distance filtering).


In [9]:
closest_sentences_df = query_matching.clusterise_sentences(
    k_closest_sentences_df=closest_sentences_df,
    number_of_clusters=3)

Took 0.23 seconds to clusterise 447 closest sentences.


In [10]:
closest_sentences_df.sort_values(by="is_closest", ascending=False).head(10)

Unnamed: 0,paper_doi,section,raw_sentence,sentence,vector,distance,cluster,is_closest
186,10.1016/j.ijid.2020.03.004,body,"At present, there is no vaccine or antiviral t...","[""present"", ""vaccine"", ""antiviral"", ""treatment...","[-0.37089855500119473, -0.7908407018468373, 0....",0.82243,2,True
6,10.3201/eid2106.150176,abstract,The antimalarial drug chloroquine has been sug...,"[""antimalarial"", ""drug"", ""chloroquine"", ""sugge...","[-0.9581731515000544, -1.888246166391475, 0.83...",0.879102,1,True
7,10.3201/eid2106.150176,body,The antimalarial drug chloroquine has been sug...,"[""antimalarial"", ""drug"", ""chloroquine"", ""sugge...","[-0.9581731515000544, -1.888246166391475, 0.83...",0.879102,1,True
9,10.1002/cbf.3182,abstract,Although the mechanisms of action of chloroqui...,"[""although"", ""mechanisms"", ""action"", ""chloroqu...","[-0.31700938330832396, -1.3640088643008028, 0....",0.876977,0,True
0,10.1038/cddis.2013.225,abstract,Chloroquine has also been used as anti-inflamm...,"[""chloroquine"", ""also"", ""used"", ""anti-inflamma...","[-0.5233842921424556, -0.8737552569358066, 0.4...",0.908413,0,False
301,10.1016/s0140-6736(10)60357-1,abstract,These data show the potential of RNA interfere...,"[""data"", ""show"", ""potential"", ""rna"", ""interfer...","[-0.07484730531915117, -1.2519306647688682, -0...",0.810981,2,False
297,10.1128/cmr.00045-07,abstract,Summary: Though several antivirals have been d...,"[""summary"", ""though"", ""several"", ""antivirals"",...","[-0.5380644737652062, -0.9421896836808804, 0.3...",0.811253,2,False
298,10.1099/jgv.0.000309,abstract,These results advocate that chloroquine should...,"[""results"", ""advocate"", ""chloroquine"", ""consid...","[-0.77465973534322, -1.8164638308488823, 0.260...",0.811225,0,False
299,10.3389/fmicb.2019.03079,abstract,"Currently, there are no vaccines or therapeuti...","[""currently"", ""vaccines"", ""therapeutic"", ""drug...","[-0.2752155307130031, -0.6325520597285712, 0.3...",0.811207,2,False
300,10.1155/2013/504563,body,Although immunoglobulin and antiviral agent ri...,"[""although"", ""immunoglobulin"", ""antiviral"", ""a...","[-0.4853182000607353, -1.3503179242955925, 0.5...",0.811082,0,False


In [11]:
# There is 3 Clusters:
# 1 = maybe Chloroquine has an effect
# 0 = effect
# 2 = no effect
closest_sentences_df["cluster"].value_counts()

2    236
0    161
1     50
Name: cluster, dtype: int64

In [12]:
for index, row in closest_sentences_df[closest_sentences_df["is_closest"] == True].iterrows():
    print(f"Cluster : {row.cluster}")
    print(f"{row.raw_sentence}")
    print()

Cluster : 1
The antimalarial drug chloroquine has been suggested as a treatment for Ebola virus infection.

Cluster : 1
The antimalarial drug chloroquine has been suggested as a treatment for Ebola virus infection.

Cluster : 0
Although the mechanisms of action of chloroquine clearly indicate that it might inhibit filoviral infections, several clinical trials that attempted to use chloroquine in the treatment of other acute viral infections – including dengue and influenza A and B – caused by low pH‐dependent viruses, have reported that chloroquine had no clinical efficacy, and these results demoted chloroquine from the potential treatments for other virus families requiring low pH for infectivity.

Cluster : 2
At present, there is no vaccine or antiviral treatment for human and animal coronavirus, so that identifying the drug treatment options as soon as possible is critical for the response to the COVID-19 outbreak.

