<font size="6" >👑 Question-Answering model for CORD-19</font>

<br>

<font size="5">Learn how to dive into CORD-19 by answering open-domain questions</font>

<br><br>

<font size="3">
    <strong>Why:</strong> at the time of writing, there are more than 580 notebooks on the <a href="https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks" >COVID-19 Open Research Dataset</a> challenge. The main reason why we are here, united, is that we want to help the research community <strong>find answers</strong>. Yes, we need to find answers now, as a deep understanding of the coronavirus infectious disease may save lives!
</font>

<br><br>
 
<font size="3">
  <strong>Simple and intuitive:</strong> the second main reason we are here is to learn and grow together.   This notebook has been designed and conceived to be easy-to-understand for beginners but, hopefully, full of valuable insights also for advanced Kagglers. The notebook runs in less than 5 minutes so feel free to fork and work on your own!
</font>

<br><br>

<font size="3">
  <strong>What:</strong> in just a few lines of code we develop from <strong>start to finish</strong> universal question-answering systems able to answer (almost) any kind of question related to coronavirus. In particular, the notebook will attempt to answer all the questions from the <a href="https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks">CORD-19 TASKS</a>. Instead of storing the final answers and conclusion in an _ugly-and-hard-to-understand_ Pandas Dataframe, we will display them on screen using some html and css (as in the image below).
</font>
  
    
<br><br>

<font size="3">
<strong>Feedback:</strong> as I believe in this cause, I put all my best into this notebook and spent quite a lot of time on the code and on formatting style to permits easy understanding. If something is unclear or wrong, please <strong>leave a comment</strong> and I will improve/fix that part. Disclaimer: work in progress, I'm still adding new code, resources and comments.
</font>


### Output results:

![ai-coronavirus](https://i.imgur.com/Vnqu9J2.png)

### 1. Introduction

#### Question-Answering (QA) model

In machine learning, a question-answering model is composed of three sources: the `question`, the `context` and the `answer`. The model inputs are the `question` and the `context` and the model output is the `answer`. In most cases, but not all, the `answer` is contained in the `context`. For simplicity, throughout the notebook, we will assume that this is indeed true.

It exists many datasets used to train the QA model. One of the most popular is she Stanford Question Answering Dataset, also known as [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/). It contains thousands of tuples of the type (`question`, `context`, `answer`) used to teach the model what does it means to both **find** and **return** a question. During training, the model exploits and learn linguistical properties of the language.

#### Using a search engine to produce the context

In general, the `context` is quite limited, about one page. In our case, instead, we are dealing with more than 40k papers. **We need therefore to reduce the size of the context**. We do so by selecting all the papers that are most similar to the `answer`. In the code, a very simple algorithm, [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25), is used. As you will see, even if Okapi BM25 is quite old (from 1980), it does a great job. In future, I plan to compare the Okapi solution against other most recent approaches and solutions such as transformers.

In a nutshell, this is all the code related to the search engine, simple, right? The `search(q)` method returns a data frame containing all papers similar to the query q.

```python
    metadata_df = pd.read_csv()
    metadata_df = clean(metadata_df)
    cse = CovidSearchEngine(metadata_df) # Covid Search Engine
    cse.search("what is coronavirus?")
```

#### From (context, question) to answer with transformers

This is the most interesting and magic part of all the notebook. Given a question `q` (we use the term query and question interchangeably), the previous section gives us a list of context. Now, for each context and for the same query `q`, we ask to a pre-trained and pretty-powerful transformer model what is the part of the context that **better represent** the query.

You may ask how in just a few lines of code we can build such a powerful model. The reason why is that we make use of the (great, you need to check it out if you haven't!) [Huggingface transformer library](https://github.com/huggingface/transformers) that permits us to work with ease with such complex and big neural networks.

The data obtained now, are dirty and hard to read. That's why for each task and for each question we visualize the context and the highlighted answer in a friendly way.

#### Summarization abstract for each question [Coming Soon]

The code for the summarization has been written but hasn't been tested and visualized yet.

#### Acknowledgement

This notebook has been inspired from the great work of:

- https://www.kaggle.com/dgunning/building-a-cord19-research-engine-with-bm25 by DwightGunning
- https://www.kaggle.com/dirktheeng/anserini-bert-squad-for-semantic-corpus-search by Dirk


### 2. Loading metadata dataframe

In [1]:
!pip install rank_bm25 -q

import numpy as np
import pandas as pd 
from pathlib import Path, PurePath

import nltk
from nltk.corpus import stopwords
import re
import string
import torch

from rank_bm25 import BM25Okapi

In [2]:
"""
Load metadata
"""

input_dir = PurePath('../input/CORD-19-research-challenge')
metadata_path = input_dir / 'metadata.csv'
metadata_df = pd.read_csv(metadata_path,
                               dtype={'Microsoft Academic Paper ID': str, 'pubmed_id': str})
metadata_df = metadata_df.dropna(subset=['abstract', 'title']).reset_index(drop=True)

  interactivity=interactivity, compiler=compiler, result=result)


### 3. Covid Search Engine

In [3]:
from rank_bm25 import BM25Okapi

# adapted from https://www.kaggle.com/dgunning/building-a-cord19-research-engine-with-bm25
english_stopwords = list(set(stopwords.words('english')))

class CovidSearchEngine:
    """
    Simple CovidSearchEngine.
    
    Usage:
    
    cse = CovidSearchEngine(metadata_df) # metadata_df is a pandas dataframe with 'title' and 'abstract' columns 
    search_results = cse.search("What is coronavirus", num=10) # Return `num` top-results
    """
    
    def remove_special_character(self, text):
        """
        Remove all special character from text string
        """
        return text.translate(str.maketrans('', '', string.punctuation))

    def tokenize(self, text):
        """
        Tokenize with NLTK

        Rules:
            - drop all words of 1 and 2 characters
            - drop all stopwords
            - drop all numbers
        """
        words = nltk.word_tokenize(text)
        return list(set([word for word in words 
                         if len(word) > 1
                         and not word in english_stopwords
                         and not word.isnumeric() 
                        ])
                   )
    
    def preprocess(self, text):
        """
        Clean and tokenize text input
        """
        return self.tokenize(self.remove_special_character(text.lower()))


    def __init__(self, corpus: pd.DataFrame):
        self.corpus = corpus
        self.columns = corpus.columns
        raw_search_str = self.corpus.abstract.fillna('') + ' ' + self.corpus.title.fillna('')
        self.index = raw_search_str.apply(self.preprocess).to_frame()
        self.index.columns = ['terms']
        self.index.index = self.corpus.index
        self.bm25 = BM25Okapi(self.index.terms.tolist())
    
    def search(self, query, num):
        """
        Return top `num` results that better match the query
        """
        search_terms = self.preprocess(query) 
        doc_scores = self.bm25.get_scores(search_terms) # get scores
        
        ind = np.argsort(doc_scores)[::-1][:num] # sort results
        
        results = self.corpus.iloc[ind][self.columns] # Initialize results_df
        results['score'] = doc_scores[ind] # Insert 'score' column
        results = results[results.score > 0]
        return results.reset_index()
    
cse = CovidSearchEngine(metadata_df) # Covid Search Engine

### 4. Code for the Question-Answering system

In [4]:
%%time

"""
LIBRARIES
"""

import torch
from transformers import BertTokenizer
from transformers import BertForQuestionAnswering


"""
SETTINGS
"""

NUM_CONTEXT_FOR_EACH_QUESTION = 10


"""
Transformers
"""

torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'

print("Code running on: {}".format(torch_device) )

#PRETRAINED_DISTILBERT_PATH = "/kaggle/input/transformers-pretrained-distilbert/distilbert-base-uncased-distilled-squad/"
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

model = model.to(torch_device)
model.eval()

def answer_question(question, context):
    """
    Answer questions
    """
    encoded_dict = tokenizer.encode_plus(
                        question, context, # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = 256,  # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt'     # Return pytorch tensors.
                   )
    
    input_ids = encoded_dict['input_ids'].to(torch_device)
    token_type_ids = encoded_dict['token_type_ids'].to(torch_device) # segments
    
    start_scores, end_scores = model(input_ids, token_type_ids=token_type_ids)

    all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
    answer = tokenizer.convert_tokens_to_string(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
    
    answer = answer.replace('[CLS]', '')
    
    return answer



from transformers import BartTokenizer, BartForConditionalGeneration

tokenizer_summarize = BartTokenizer.from_pretrained('bart-large-cnn')
model_summarize = BartForConditionalGeneration.from_pretrained('bart-large-cnn').to(torch_device)


model_summarize.to(torch_device)
# Set the model in evaluation mode to deactivate the DropOut modules
model_summarize.eval()

def get_summary(text):
    """
    Get summary
    """
    
    answers_input_ids = tokenizer_summarize.batch_encode_plus(
        [text], return_tensors='pt', max_length=1024
    )['input_ids']
    
    answers_input_ids = answers_input_ids.to(torch_device)
    
    summary_ids = model_summarize.generate(answers_input_ids,
                                           num_beams=4,
                                           max_length=5,
                                           early_stopping=True
                                          )
        
    return tokenizer_summarize.decode(summary_ids.squeeze(), skip_special_tokens=True, clean_up_tokenization_spaces=False)

    
"""
Main 
"""



def create_output_results(question, all_contexts, all_answers, summary_answer, summary_context):
    """
    Return a dictionary of the form
    
    {
        question: 'what is coronavirus',
        results: [
            {
                'context': 'coronavirus is an infectious disease caused by',
                'answer': 'infectious disease'
                'start_index': 18
                'end_index': 36
            },
            {
                ...
            }
        ]
    }
    
    Start and end index are useful to find the position of the answer in the context  
    """
    
    def find_start_end_index_substring(context, answer):   
        search_re = re.search(re.escape(answer.lower()), context.lower())
        if search_re:
            return search_re.start(), search_re.end()
        else:
            return 0, len(context)
        
    output = {}
    output['question'] = question
    output['summary_answer'] = summary_answer
    output['summary_context'] = summary_context
    results = []
    for c, a in zip(all_contexts, all_answers):

        span = {}
        span['context'] = c
        span['answer'] = a
        span['start_index'], span['end_index'] = find_start_end_index_substring(c,a)

        results.append(span)
    
    output['results'] = results
        
    return output


def get_all_context(query, num_results):
    """
    Search in the metadata dataframe and return the first `num` results that better match the query 
    """
    
    papers_df = cse.search(query, num_results)
    return papers_df['abstract'].str.replace("Abstract", "").tolist()


def get_all_answers(question, all_context):
    """
    Return a list of all answers, given a question and a list of context
    """    
    
    all_answers = []
    
    for context in all_context:
        all_answers.append(answer_question(question, context))
    return all_answers

    
def get_results(question, summarize=False, num_results=NUM_CONTEXT_FOR_EACH_QUESTION, verbose=True):
    """
    Return dict object containg a list of all context and answers related to the (sub)question
    """
    
    if verbose:
        print("Getting context ...")
    all_contexts = get_all_context(question, num_results)
    
    if verbose:
        print("Answering to all questions ...")
    all_answers = get_all_answers(question, all_contexts)
    
    summary_answer = ''
    summary_context = ''
    if verbose and summarize:
        print("Adding summary ...")
    if summarize:
        summary_answer = get_summary(all_answers)
        summary_context = get_summary(all_contexts)
    
    if verbose:
        print("output.")
    
    return create_output_results(question, all_contexts, all_answers, summary_answer, summary_context)

Code running on: cuda


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=398.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1340675298.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1497.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1625270765.0, style=ProgressStyle(descr…


CPU times: user 1min 25s, sys: 12.8 s, total: 1min 38s
Wall time: 2min 1s


### 5. Dict object to store all Kaggle CORD-19 tasks

We store in a dict object all the CORD-19 tasks and the respective questions. In the next section, we will iterate over this dictionary and produce the different answers.

In [5]:
# adapted from https://www.kaggle.com/dirktheeng/anserini-bert-squad-for-semantic-corpus-search

covid_kaggle_questions = {
"data":[
          {
              "task": "What is known about transmission, incubation, and environmental stability?",
              "questions": [
                  "Is the virus transmitted by aerisol, droplets, food, close contact, fecal matter, or water?",
                  "How long is the incubation period for the virus?",
                  "Can the virus be transmitted asymptomatically or during the incubation period?",
                  "How does weather, heat, and humidity affect the tramsmission of 2019-nCoV?",
                  "How long can the 2019-nCoV virus remain viable on common surfaces?"
              ]
          },
          {
              "task": "What do we know about COVID-19 risk factors?",
              "questions": [
                  "What risk factors contribute to the severity of 2019-nCoV?",
                  "How does hypertension affect patients?",
                  "How does heart disease affect patients?",
                  "How does copd affect patients?",
                  "How does smoking affect patients?",
                  "How does pregnancy affect patients?",
                  "What is the fatality rate of 2019-nCoV?",
                  "What public health policies prevent or control the spread of 2019-nCoV?"
              ]
          },
          {
              "task": "What do we know about virus genetics, origin, and evolution?",
              "questions": [
                  "Can animals transmit 2019-nCoV?",
                  "What animal did 2019-nCoV come from?",
                  "What real-time genomic tracking tools exist?",
                  "What geographic variations are there in the genome of 2019-nCoV?",
                  "What effors are being done in asia to prevent further outbreaks?"
              ]
          },
          {
              "task": "What do we know about vaccines and therapeutics?",
              "questions": [
                  "What drugs or therapies are being investigated?",
                  "Are anti-inflammatory drugs recommended?"
              ]
          },
          {
              "task": "What do we know about non-pharmaceutical interventions?",
              "questions": [
                  "Which non-pharmaceutical interventions limit tramsission?",
                  "What are most important barriers to compliance?"
              ]
          },
          {
              "task": "What has been published about medical care?",
              "questions": [
                  "How does extracorporeal membrane oxygenation affect 2019-nCoV patients?",
                  "What telemedicine and cybercare methods are most effective?",
                  "How is artificial intelligence being used in real time health delivery?",
                  "What adjunctive or supportive methods can help patients?"
              ]
          },
          {
              "task": "What do we know about diagnostics and surveillance?",
              "questions": [
                  "What diagnostic tests (tools) exist or are being developed to detect 2019-nCoV?"
              ]
          },
          {
              "task": "Other interesting questions",
              "questions": [
                  "What is the immune system response to 2019-nCoV?",
                  "Can personal protective equipment prevent the transmission of 2019-nCoV?",
                  "Can 2019-nCoV infect patients a second time?"
              ]
          }
   ]
}

### 6. Answer to all questions

... and store it in the `all_answers` dataframe.

In [6]:
all_tasks = []


for i, t in enumerate(covid_kaggle_questions['data']):
    print("Answering question to task {}. ...".format(i+1))
    answers_to_question = []
    for q in t['questions']:
            answers_to_question.append(get_results(q, verbose=False))
    task = {}
    task['task'] = t['task']
    task['questions'] = answers_to_question
    
    all_tasks.append(task)

all_answers = {}
all_answers['data'] = all_tasks


Answering question to task 1. ...
Answering question to task 2. ...
Answering question to task 3. ...
Answering question to task 4. ...
Answering question to task 5. ...
Answering question to task 6. ...
Answering question to task 7. ...
Answering question to task 8. ...


### 7. Display questions, context and answers

With some HTML/CSS to make it more elegant and intepretable! ;)

In [7]:
from IPython.display import display, Markdown, Latex, HTML

def layout_style():
    
    
    style = """
        div {
            color: black;
        }
        
        .single_answer {
            border-left: 3px solid #dc7b15;
            padding-left: 10px;
            font-family: Arial;
            font-size: 16px;
            color: #777777;
            margin-left: 5px;

        }
        
        .answer{
            color: #dc7b15;
        }
        
        .question_title {
            color: grey;
            display: block;
            text-transform: none;
        }
               
        div.output_scroll { 
            height: auto; 
        }
    
    """
    
    return "<style>" + style + "</style>"

def dm(x): display(Markdown(x))
def dh(x): display(HTML(layout_style() + x))
    
def display_task(task):
    m("## " + task['task'])
    
#display_task(task1['data'][0])


def display_single_context(context, start_index, end_index):
    
    before_answer = context[:start_index]
    answer = context[start_index:end_index]
    after_answer = context[end_index:]

    content = before_answer + "<span class='answer'>" + answer + "</span>" + after_answer

    return dh("""<div class="single_answer">{}</div>""".format(content))

def display_question_title(question):
    return dh("<h2 class='question_title'>{}</h2>".format(question.capitalize()))

def answer_not_found(context, start_index, end_index):
    return (start_index == 0 and len(context) == end_index) or (start_index == 0 and end_index == 0)
def display_all_context(index, question):
    
    display_question_title(str(index + 1) + ". " + question['question'].capitalize())
    
    # display context
    for i in question['results']:
        if answer_not_found(i['context'], i['start_index'], i['end_index']):
            continue # skip not found questions
        display_single_context(i['context'], i['start_index'], i['end_index'])

def display_task_title(index, task):
    task_title = "Task " + str(index) + ": " + task
    return dh("<h1 class='task_title'>{}</h1>".format(task_title))

def display_single_task(index, task):
    
    display_task_title(index, task['task'])
    
    for i, question in enumerate(task['questions']):
        display_all_context(i, question)

task = 1
display_single_task(task, all_tasks[task-1])

In [8]:
task = 2
display_single_task(task, all_tasks[task-1])

In [9]:
task = 3
display_single_task(task, all_tasks[task-1])

In [10]:
task = 4
display_single_task(task, all_tasks[task-1])

In [11]:
task = 5
display_single_task(task, all_tasks[task-1])

In [12]:
task = 6
display_single_task(task, all_tasks[task-1])

In [13]:
task = 7
display_single_task(task, all_tasks[task-1])

In [14]:
task = 8
display_single_task(task, all_tasks[task-1])

### 8. Export solutions

Export the solutions JSON file for future analysis. It may be handy for you too!

In [15]:
import json
with open("covid_kaggle_answer_from_qa.json", "w") as f:
    json.dump(all_answers, f)

### 9. Conclusion 👍

We are at the end of our journey. I hope you enjoyed and learned something new. If there is a part of the code you would like to understand better: just ask.

In the next versions, I will implement a more sophisticated search engine, will try to improve the questions answers, implement the summarization and also provide a short comment to each task and the main findings/highlights. Also, I would like to add other information the context such as the link to the full research paper. Stay tuned.  


Thank you for reading it all! .)
