# COVID-19 Open Research Dataset - Text and Data Mining Tools

- Text and data mining tools that can help the medical community develop answers to high priority scientific questions
- Text mining tools to provide insights on these questions
- Find answers to questions within
- Connect insights across

## Intro

From questions, I'm interested to find semantically similar words, phrases, context then find wider range of articles for researchers to be able explore the search results. Then using Q&A model to find relevant answer or section of the articles.

I tried to focus on retriving wider range of semantically relevant articles rather than focuing on accuracy. When a question is challenging to answer, it may be useful to see the range of results.


## Semantic Similarity Search using RoBERTa

We will be using [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) in order to create sentence embeddings for semantic similarity search task. The library enable us to use fine-tuned BERT / RoBERTa / DistilBERT / ALBERT / XLNet models for semantic textual similarity tasks. 

We then find distance from question to article by cosine distance to rank the articles.



## Bag-of-words retrieval 

We will also be using [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) in order to search relevant articles by frequency of words used in the question.

My finding was, Semantic Similarity Search gives higher recall results however those ranked at the top might not be most relevant to topic of the question. So by combining the score from Semantic Similarity Search and B25 might give diverse results within the topic.

## Question Answering with BERT


After ranking articles by these 2 methods, then we combine the scores by averaging them in order to find most relevant articles. 

We then use [Transformers](https://huggingface.co/transformers/) BertForQuestionAnswering to find answers or relevant section in the article.


## Conclusions


Given the limited RAM, Disk space available on Kaggle Kernal environment, I only manage to process dataset in `custom_license` folder. 

The approach I took was for my initial exploration and reseach into these NLP methods and I'm pretty sure improvements can easily made and save lot's of time complexity and space complexity 😓. 

BERT Q&A model sometimes didn't find confident enough answers for some of the questions specially when the query sentense isn't question format and short in number of words. I have prefixed with phrase "What do we know about" for those question as a simple workaround.


## Lastly 

There were few other variations of these approaches I wanted to try and test different parameters.  

The results seems quite convincing however virology isn't my area of expertise so would be interesting to find out if the results are any useful to those reseacher who are in the front line of the fileds.

Also thanks to the challenge and participators. It was inspirational to me to see and learn different approaches and techniques. 

### Install packages

In [None]:
!pip install -q sentence-transformers
!pip install -q tqdm
!pip install -q sqlitedict
!pip install -q transformers
!pip install -q gensim==3.8.2

### Imports

In [None]:
import numpy as np
import pandas as pd
import glob
import os
import json
import shelve
import nltk
import pickle
from tqdm.notebook import tqdm
from google.cloud import storage
from sentence_transformers import SentenceTransformer
from io import BytesIO
from gensim.summarization.bm25 import BM25
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import minmax_scale
from sqlitedict import SqliteDict
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
import logging
from scipy.spatial.distance import cosine
from IPython.core.display import display, HTML

logging.getLogger("transformers").setLevel(logging.ERROR)
nltk.download('punkt')
nltk.download('stopwords')

### GCS Client

In [None]:
class GcsBucket:

    def __init__(self, bucket):
        self.bucket = bucket
        
        
    def upload_file(self, path, key):
        blob = self.bucket.blob(key)
        return blob.upload_from_filename(path)


    def get_file(self, key):
        blob = storage.blob.Blob(key, self.bucket)
        content = blob.download_as_string()
        return BytesIO(content)
    
    
    def download_file(self, key, path):
        blob = storage.blob.Blob(key, self.bucket)
        return blob.download_to_filename(path)

    
    def file_exists(self, key):    
        blob = self.bucket.blob(key)
        return blob.exists()
    
    
    def remove_file(self, key):
        blob = self.bucket.blob(path)
        return blob.delete()


gcs_client = storage.Client(project='nomadic-zoo-255800')
bucket = GcsBucket(
    bucket=gcs_client.get_bucket("yukku")
)

### Constants

In [None]:
QUESTION_TOPICS = "COVID-19, coronavirus, SARS-CoV-2, novel coronavirus"
MODEL_NAME = "roberta-large-nli-stsb-mean-tokens"

DATASET_DIR = "/kaggle/input/CORD-19-research-challenge"
DATA_DIR = "/kaggle/working/data"

GCS_DATA_DIR = "covid_19"

DATASET_PATH = os.path.join(DATA_DIR, "CORD-19-research-challenge-2.csv")
GCS_DATASET_PATH = os.path.join(GCS_DATA_DIR, "CORD-19-research-challenge-2.csv")

DATABASE_PATH = os.path.join(DATA_DIR, "database-2.db")
GCS_DATABASE_PATH = os.path.join(GCS_DATA_DIR, "database-2.db")

BM25_PATH = os.path.join(DATA_DIR, "bm25-2.pickle")
GCS_BM25_PATH = os.path.join(GCS_DATA_DIR, "bm25-2.pickle")


!mkdir -p "{DATA_DIR}"

### Load dataset

In [None]:
def create_dataset(json_filenames):
    
    dataframe = pd.DataFrame([], columns=[
        "doc_id", 
        "title", 
        "abstract", 
        "text_body"
    ])
    
    for file_name in tqdm(json_filenames):
        
        with open(file_name) as json_data:
            data = json.load(json_data)
            row = {
                "doc_id": None, 
                "title": None,
                "abstract": None, 
                "text_body": None
            }
            row['doc_id'] = data['paper_id']
            row['title'] = data['metadata']['title']
            
            if 'abstract' in data:
                row['abstract'] = "\n ".join(
                    [abst['text'] for abst in data['abstract']]
                )
  
            row['text_body'] = "\n ".join(
                [bt['text'] for bt in data['body_text']]
            )
                                    
            dataframe = dataframe.append(
                row, 
                ignore_index=True
            )
    
    return dataframe


def remove_file(path):
    
    if os.path.exists(path):
        os.remove(path)
        
        
def save_dataset(dataframe, path):
    
    os.makedirs(os.path.dirname(path), exist_ok=True)
    dataframe.to_csv(path, header=None, index=None)

    
def load_dataset(reload=False):
    if bucket.file_exists(GCS_DATASET_PATH) and not reload:
        dataframe = pd.read_csv(
            bucket.get_file(GCS_DATASET_PATH), 
            names=[
                "doc_id", 
                "title", 
                "abstract", 
                "text_body"
            ]
        )
    else:
        dataframe = create_dataset(
            glob.glob(
                f"{DATASET_DIR}/**/custom_license/pdf_json/*.json", 
                recursive=True
            )
        )
        save_dataset(dataframe, DATASET_PATH)
        response = bucket.upload_file(
            DATASET_PATH, 
            GCS_DATASET_PATH
        )
        remove_file(DATASET_PATH)    
        
    return dataframe


dataset = load_dataset(reload=False)

In [None]:
print(f"total document of {len(dataset)}")
dataset.head()

### BM25

In [None]:
def create_corpus(dataset):
    df = dataset["title"].fillna('') \
        + dataset["abstract"].fillna('') \
        + dataset["text_body"].fillna('')
    for index, text in tqdm(df.iteritems(), total=len(df)):
        yield word_tokenize(text)


def load_bm25(save_path, gcs_path, reload=False):
    if bucket.file_exists(gcs_path) and not reload:
        bucket.download_file(gcs_path, save_path)
        with open(save_path, 'rb') as file:
            return pickle.load(file)
    else:
        
        bm25 = BM25(create_corpus(dataset))
        with open(save_path, 'wb') as file:
            pickle.dump(bm25, file)

        bucket.upload_file(save_path, gcs_path)
        return bm25


bm25 = load_bm25(
    save_path=BM25_PATH, 
    gcs_path=GCS_BM25_PATH,
    reload=False
)

### Instanticate Sentence Transformer

In [None]:
model = SentenceTransformer(MODEL_NAME)
tensors = model.cuda()

### Load embedding database

In [None]:
def get_sentences(texts):
    results = []
    if not pd.isnull(texts): 
        results = nltk.tokenize.sent_tokenize(texts)
    return results
    

def iter_sentences(dataset):
    for index, row in dataset.iterrows():
        doc_id = row[0]
        title = row[1]
        abstract = row[2]
        
        for index, sentence in enumerate(get_sentences(title)):
            yield (f"{doc_id}-title-{index}", sentence)

        for index, sentence in enumerate(get_sentences(abstract)):
            yield (f"{doc_id}-abstract-{index}", sentence)
            

def iter_embeddings(model, dataset):
    batch_length = 50000
    
    doc_ids = []
    sentences = []
    count = 0
    for index, (doc_id, sentence) in enumerate(iter_sentences(dataset)): 
        doc_ids.append(doc_id)
        sentences.append(sentence)
            
        if len(doc_ids) >= batch_length or index == len(dataset) - 1:
            print(f"processing {count * batch_length} to {(count + 1) * batch_length}")

            embeddings = model.encode(sentences)  
            embedding_ids = doc_ids
            doc_ids = []
            sentences = []
            count += 1
            for doc_id, embedding in zip(embedding_ids, embeddings):
                yield (doc_id, embedding)
           
                
def create_and_save_embeddings(model, dataset, database):
    for doc_id, embedding in iter_embeddings(model, dataset):
        database[doc_id] = embedding

    database.commit()
 

def load_database(model, dataset, save_path, gcs_path, reload=False):
    if bucket.file_exists(gcs_path) and not reload:
        bucket.download_file(gcs_path, save_path)
    else:
        database = SqliteDict(save_path)
        create_and_save_embeddings(model, dataset, database)
        database.close()
        bucket.upload_file(save_path, gcs_path)

    return SqliteDict(save_path)
        

database = load_database(
    model=model, 
    dataset=dataset, 
    save_path=DATABASE_PATH, 
    gcs_path=GCS_DATABASE_PATH
)

### Scoring using cosine distance between embeddings and BM25

In [None]:
def build_get_relevant_documents(model, dataset, database):
    
    def get_relevant_documents(text, max_search):
        target_embedding = model.encode([text])[0]

        semantic_similarity_scores = {}
        for key, embedding in tqdm(database.iteritems(), total=len(database)):
            similarity = 1 - cosine(target_embedding, embedding)
            [doc_id, doc_type, doc_index] = key.split("-")
            
            has_score = doc_id in semantic_similarity_scores
            doc_higher_score = has_score and semantic_similarity_scores[doc_id] < similarity
            if not has_score or doc_higher_score:
                semantic_similarity_scores[doc_id] = similarity
       
        dataset.loc[:, "score"] = 0
        dataset["score"] = dataset["doc_id"] \
            .map(semantic_similarity_scores)

        bm25_scores = minmax_scale(bm25.get_scores(word_tokenize(text)), feature_range=(0,1))
        dataset["score"] = (dataset["score"] + bm25_scores)/2

        return dataset.sort_values("score", ascending=False).head(n=max_search)
        
    return get_relevant_documents


get_relevant_documents = build_get_relevant_documents(
    model=model,
    dataset=dataset,
    database=database
)

### Instanciate Q&A BERT model

In [None]:
qa_model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
qa_tokeniser = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tensors = qa_model.to('cuda' if torch.cuda.is_available() else 'cpu')
tensors_eval = qa_model.eval()

In [None]:
def create_features(question, document):
    input_ids = qa_tokeniser.encode(question, document)
    tokens = qa_tokeniser.convert_ids_to_tokens(input_ids)  
    sep_index = input_ids.index(qa_tokeniser.sep_token_id)

    seg_start_count = sep_index + 1
    seg_end_count = len(input_ids) - seg_start_count
    segment_ids = [0]*seg_start_count + [1]*seg_end_count

    return input_ids, tokens, segment_ids


def get_paragraphs(question, document):

    sentences = get_sentences(document)
    paragraphs = []
    
    while len(sentences) > 0:
        expected_length = len(
            qa_tokeniser.encode(
                question, " ".join(paragraphs) + " " + sentences[0]
            )
        )
        if expected_length < qa_tokeniser.max_len  and len(sentences) > 0:
            paragraphs.append(sentences.pop(0)) 
        elif len(paragraphs) == 0 and len(sentences) > 0:
            sentences.pop(0)
        else:
            out = paragraphs
            paragraphs = [] 
            yield " ".join(out)


def get_answer(question, document, min_confidence):
    
    torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
    answers = []
    answers_all = []
    confidences = []

    for paragraph in get_paragraphs(question, document):
        
        input_ids, tokens, segment_ids = create_features(question, paragraph)   

        start_scores, end_scores = qa_model(
            torch.tensor([input_ids]).to(torch_device), 
            token_type_ids=torch.tensor([segment_ids]).to(torch_device)
        )

        start_scores = start_scores[:,1:-1]
        end_scores = end_scores[:,1:-1]

        answer_start = torch.argmax(start_scores)
        answer_end = torch.argmax(end_scores) + 1

        answer = qa_tokeniser.decode(
            input_ids[answer_start:][:answer_end], 
            skip_special_tokens=True, 
            clean_up_tokenization_spaces=True
        )
        answers.append(answer)

        answer_all = qa_tokeniser.decode(
            input_ids[tokens.index('[SEP]') + 1:][:-1], 
            skip_special_tokens=True, 
            clean_up_tokenization_spaces=True
        )
        answers_all.append(answer_all)

        ## to do: find out issue of the following error
        ## IndexError: index 478 is out of bounds for dimension 1 with size 478
        try:
            confidences.append(
                start_scores[0,answer_start].item() + end_scores[0,answer_end].item()
            )
        except IndexError:
            confidences.append(
                start_scores[0,answer_start].item() + end_scores[0,answer_end - 1].item()
            )
        


    if len(answers) == 0:
        return None
        
    confidence = max(confidences)
    final_answer = answers[confidences.index(confidence)]
    answer_all = answers_all[confidences.index(confidence)]

    if len(final_answer) > 0 and confidence > min_confidence:
        return {
            "confidence": confidence, 
            "text": final_answer, 
            "paragraph": answer_all
        }
    else:
        return None



def get_answers(question_topics, question_context, question, max_search, min_confidence):

    relevant_docs = get_relevant_documents(
        f"{question_topics} {question_context} {question}",
        max_search
    )

    for index, row in relevant_docs.iterrows():
        document = row["text_body"] or row["abstract"]
        if not pd.isna(document):
            answer = get_answer(question, document, min_confidence)
            if answer:
                yield row["doc_id"], row["title"], answer["text"], answer["paragraph"], answer["confidence"]
    

### Result display

In [None]:
def create_highlighted(text, text_all):
    split_text = text_all.split(text)
    beginning_text = ""
    end_text = ""

    if len(split_text) > 0:
        beginning_text = split_text[0]
    
    if len(split_text) > 1:
        end_text = split_text[1]
    
    return f"<div>" \
        f"<font color='#bbbbbb'>{beginning_text}</font>" \
        f"{text}" \
        f"<font color='#bbbbbb'>{end_text}</font>" \
        f"</div>"

    
def display_answers(
        question_context, 
        question,
        question_topics=QUESTION_TOPICS, 
        max_search=50, 
        min_confidence=0
    ):
    
    data = []
    
    answers = list(get_answers(question_topics, question_context, question, max_search, min_confidence))
    answers = sorted(answers, key=lambda x: x[4], reverse=True)[0:10]
    
    if len(answers) == 0:
        display(
            HTML(
                f"<h4>Confident enough results not found</h4>"
            )
        )
        return None

    for doc_id, title, text, paragraph, confidence in answers:
        data.append([title, confidence, create_highlighted(text, paragraph)])

    dataframe = pd.DataFrame(
        data, 
        columns = ["title", "confidence", "answer"]
    )

    display(
        HTML(
            f"<br/><h2>{question_context}</h2><br/>" \
            f"<h3> - {question}</h3>"
        )
    )

    display(
        HTML(
            dataframe.to_html(
                render_links=True, 
                escape=False
            )
        )
    )

# Results

In [None]:
display_answers(
    question_context="What is known about transmission, incubation, and environmental stability?",
    question="Range of incubation periods for the disease in humans",
    question_topics=f"{QUESTION_TOPICS}, how this varies across age and health status and how long individuals are contagious, even after recovery",
    max_search=50,
    min_confidence=0
)

In [None]:
display_answers(
    question_context="What is known about transmission, incubation, and environmental stability?",
    question="Prevalence of asymptomatic shedding and transmission",
    question_topics=f"{QUESTION_TOPICS}, children",
    max_search=50,
    min_confidence=0
)

In [None]:
display_answers(
    question_context="What is known about transmission, incubation, and environmental stability?",
    question="What do we know about Seasonality of transmission",
    question_topics=f"{QUESTION_TOPICS}",
    max_search=50,
    min_confidence=0
)

In [None]:
display_answers(
    question_context="What is known about transmission, incubation, and environmental stability?",
    question="Physical science of the coronavirus",
    question_topics=f"{QUESTION_TOPICS}, charge distribution, adhesion to hydrophilic/phobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding",
    max_search=50,
    min_confidence=0
)

In [None]:
display_answers(
    question_context="What is known about transmission, incubation, and environmental stability?",
    question="Persistence and stability on a multitude of substrates and sources",
    question_topics=f"{QUESTION_TOPICS}, nasal discharge, sputum, urine, fecal matter, blood",
    max_search=50,
    min_confidence=0
)

In [None]:
display_answers(
    question_context="What is known about transmission, incubation, and environmental stability?",
    question="Persistence of virus on surfaces of different materials",
    question_topics=f"{QUESTION_TOPICS}, copper, stainless steel, plastic",
    max_search=50,
    min_confidence=0
)

In [None]:
display_answers(
    question_context="What is known about transmission, incubation, and environmental stability?",
    question="Natural history of the virus and shedding of it from an infected person",
    question_topics=f"{QUESTION_TOPICS}",
    max_search=50,
    min_confidence=0
)

In [None]:
display_answers(
    question_context="What is known about transmission, incubation, and environmental stability?",
    question="Implementation of diagnostics and products to improve clinical processes",
    question_topics=f"{QUESTION_TOPICS}",
    max_search=50,
    min_confidence=0
)

In [None]:
display_answers(
    question_context="What is known about transmission, incubation, and environmental stability?",
    question="What do we know about disease models, including animal models",
    question_topics=f"{QUESTION_TOPICS}, infection, disease and transmission",
    max_search=50,
    min_confidence=0
)

In [None]:
display_answers(
    question_context="What is known about transmission, incubation, and environmental stability?",
    question="What do we know about Tools and studies to monitor phenotypic change",
    question_topics=f"{QUESTION_TOPICS}, potential adaptation of the virus",
    max_search=50,
    min_confidence=0
)


In [None]:
display_answers(
    question_context="What is known about transmission, incubation, and environmental stability?",
    question="What do we know about Immune response and immunity",
    question_topics=f"{QUESTION_TOPICS}",
    max_search=50,
    min_confidence=0
)


In [None]:
display_answers(
    question_context="What is known about transmission, incubation, and environmental stability?",
    question="Effectiveness of movement control strategies to prevent secondary transmission",
    question_topics=f"{QUESTION_TOPICS}, health care and community settings",
    max_search=50,
    min_confidence=0
)


In [None]:
display_answers(
    question_context="What is known about transmission, incubation, and environmental stability?",
    question="Effectiveness of personal protective equipment (PPE) and its usefulness to reduce risk of transmission in health care and community settings",
    question_topics=f"{QUESTION_TOPICS}",
    max_search=50,
    min_confidence=0
)


In [None]:
display_answers(
    question_context="What is known about transmission, incubation, and environmental stability?",
    question="What do we know about Role of the environment in transmission",
    question_topics=f"{QUESTION_TOPICS}",
    max_search=50,
    min_confidence=0
)