# COIVD Research Challenge - Non-Pharmaceutical Intervention

Created by a TransUnion[https://www.transunion.com/] data scientist that believes that information can be used to **change our world for the better**. #InformationForGood

# Introduction

I took advantage of NLP and ML tools to develop improved ways of finding relevant research. For the competition, 10 tasks have been proposed. Each task covers some fundamental questions related to COVID-19. In this notebook, my focus is on discovering the task related to **non-pharmaceutical interven**

# Method

**Tools**

    - EDA 
        * Doc2Vec
        * WordCloud
        
    - Model
        * Bert QA

## 1. Searching the COVID-19 Article Dataset

The dataset for this competition contains more than 100,000 articles, where they are saved in JSON format. To make it used efficiently, I made a dataframe organizing these articles and contain their `title`, `abstract`, and `body text`.

The idea of searching related article to non-pharmaceutical intervention is looking for the nearest neighbors in text dimension.  To convert text data into a context can be read by numeric distribution, I used Doc2Vec to represent each article.

## 2. Answering Non-Pharmaceutical Intervention Question From Data

BERT is a contextual word representation model learned from large-scale language model pretraining of a bidirectional Transformer (Vaswani et al. 2017). Recent work has shown major improvements in a wide variety of tasks using BERT or similar Transformer models. The query taken as a input will be a general question for NPI task, and we are looking for the span of answers from most similar articles. 


In [None]:
CUDA_LAUNCH_BLOCKING=1 

In [None]:
import spacy
import os
import pandas as pd
import numpy as np
import nltk
import re
import torch
import json
import string
import sys
import random
import time

Check out metadata info

In [None]:
# read meta data
import pandas as pd
df = pd.read_csv('../input/CORD-19-research-challenge/metadata.csv')
df.shape

Organizing all articles

In [None]:
from tqdm import tqdm
path = "../input/CORD-19-research-challenge/document_parses"
# sub-folder
subdir = ["pdf_json", "pmc_json"]
article = []

for d in subdir:
    for f in tqdm(os.listdir(f"{path}/{d}")):
        json_file = json.load(open(f"{path}/{d}/{f}", "rb"))
        title = json_file["metadata"]["title"]
        if d == "pdf_json":
            abstract = "\n\n".join([t["text"] for t in json_file["abstract"]])
        else:
            abstract = "\n"
        body = "\n\n".join([t["text"] for t in json_file["body_text"]])
        paper_id = json_file["paper_id"]
        article.append([paper_id, title, abstract, body])

In [None]:
import gc
article_df = pd.DataFrame(article, columns = ["paper_id", "title", "abstract", "body"])
del article
gc.collect()
article_df.head

Clean and Tokenize text

In [None]:
!pip install spacy_langdetect

In [None]:
import string
from nltk.stem import WordNetLemmatizer
from spacy_langdetect import LanguageDetector
from sklearn.base import BaseEstimator, TransformerMixin
import gensim
from langdetect import detect
from nltk.corpus import stopwords
from pprint import pprint
from gensim.models.doc2vec import Doc2Vec

nltk.download('stopwords')
nltk.download('wordnet')
nlp = spacy.load('en')
nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)

class PreProcess(BaseEstimator, TransformerMixin):
    def tokenizer(self, input_text):
        return re.split('\W+', input_text)

    def remove_urls(self, input_text):
        return re.sub(r'http.?://[^\s]+[\s]?', '', input_text)
    
    def remove_punctuation(self, input_text):
        trantab = str.maketrans('', '', string.punctuation)
        return input_text.translate(trantab)
    
    def remove_digits(self, input_text):
        return re.sub('\d+', '', input_text)
    
    def to_lower(self, input_text):
        return input_text.lower()
    
    def remove_stopwords(self, words):
        stopwords_list = stopwords.words('english')
        whitelist=[]
        clean_words = [word for word in words if (word not in stopwords_list or word in whitelist) and len(word) > 1] 
        return clean_words
    
    def stemming(self, words):
        porter = PorterStemmer()
        stemmed_words = [porter.stem(word) for word in words]
        return " ".join(stemmed_words)

    def lemma(self,words):
        lemmatizer = WordNetLemmatizer()
        stemmed_words = [lemmatizer.lemmatize(word) for word in words]
        return stemmed_words 
    def english_only(self, words):
        english_words = []
        for word in words:
            if detect(word) == 'en':
                english_words.append(word)
        return english_words
        

In [None]:
def clean(text):
    pp = PreProcess() 
    text = str(text)
    clean = pp.remove_urls(text)
    clean = pp.remove_punctuation(clean)
    clean = pp.remove_digits(clean)
    clean = pp.to_lower(clean)
    clean = pp.tokenizer(clean)
    clean = pp.remove_stopwords(clean)
    clean = pp.lemma(clean)
    return clean

In [None]:
article_df.shape

In [None]:
# title
title_tokenized = []
title = article_df['title'].values
for i in title:
    title_tokenized.append(clean(i))
title_tokenized = np.array(title_tokenized)
# body
body_tokenized = []
body = article_df['body'].values
for i in title:
    body_tokenized.append(clean(i))
body_tokenized = np.array(body_tokenized)
# abstract
abstract_tokenized = []
abstract = article_df['abstract'].values
for i in title:
    abstract_tokenized.append(clean(i))
abstract_tokenized = np.array(abstract_tokenized)

In [None]:
#Clean title, text and abstract 
article_df['title_tokenized'] = title_tokenized
article_df['body_tokenized'] = body_tokenized
article_df['abstract_tokenized'] = abstract_tokenized

In [None]:
#Combine title, text, and abstract
article_df['complete_text_tokenized'] = article_df['title_tokenized'] + article_df['body_tokenized'] + article_df['abstract_tokenized']
selected_article = article_df[article_df['complete_text_tokenized'].map(len) > 100]

#Describing our final dataframe.
selected_article.describe

In [None]:
selected_article.shape

Train a doc2vec model
 

In [None]:
def read_corpus(df, column):
    for i, line in enumerate(df[column]):
        yield gensim.models.doc2vec.TaggedDocument(line, [i])

train_df  = selected_article.sample(frac=1, random_state=42)

#train corpus
train_corpus = (list(read_corpus(train_df, 'complete_text_tokenized'))) 

In [None]:
# Doc2VEC : using distributed memory model
model = gensim.models.doc2vec.Doc2Vec(dm=1, vector_size=300, min_count=10, epochs=20, seed=42, workers=10)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

Before use Doc2vec, use Rake to extract keyword.

`RAKE` short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text.

Turn task detail into a vector

In [None]:
task5 = "What do we know about the effectiveness of non-pharmaceutical interventions? What is known about equity and barriers to compliance for non-pharmaceutical interventions? Guidance on ways to scale up NPIs in a more coordinated way (e.g., establish funding, infrastructure and authorities to support real time, authoritative (qualified participants) collaboration with all states to gain consensus on consistent guidance and to mobilize resources to geographic areas where critical shortfalls are identified) to give us time to enhance our health care delivery system capacity to respond to an increase in cases.Guidance on ways to scale up NPIs in a more coordinated way (e.g., establish funding, infrastructure and authorities to support real time, authoritative (qualified participants) collaboration with all states to gain consensus on consistent guidance and to mobilize resources to geographic areas where critical shortfalls are identified) to give us time to enhance our health care delivery system capacity to respond to an increase in cases. Rapid design and execution of experiments to examine and compare NPIs currently being implemented. DHS Centers for Excellence could potentially be leveraged to conduct these experiments.Rapid assessment of the likely efficacy of school closures, travel bans, bans on mass gatherings of various sizes, and other social distancing approaches. Methods to control the spread in communities, barriers to compliance and how these vary among different populations.. Models of potential interventions to predict costs and benefits that take account of such factors as race, income, disability, age, geographic location, immigration status, housing status, employment status, and health insurance status. Policy changes necessary to enable the compliance of individuals with limited resources and the underserved with NPIs. Research on why people fail to comply with public health advice, even if they want to do so (e.g., social or financial costs may be too high). Research on the economic impact of this or any pandemic. This would include identifying policy and programmatic alternatives that lessen/mitigate risks to critical government services, food distribution and supplies, access to critical household supplies, and access to health diagnoses, treatment, and needed care, regardless of ability to pay."

In [None]:
!pip install rake_nltk

In [None]:
from rake_nltk import Rake
r = Rake() 
r.extract_keywords_from_text(task5)

task = ' '.join(r.get_ranked_phrases())

In [None]:
task

In [None]:
def get_doc_vector(doc):
    tokens = clean(doc) 
    vector = model.infer_vector(tokens)
    return vector

task_array = [get_doc_vector(task)]

In [None]:
from sklearn.neighbors import NearestNeighbors
selected_article['complete_text_vector'] = [vec for vec in model.docvecs.vectors_docs]
text_array = selected_article['complete_text_vector'].values.tolist()

#Apply KNN to extract 50 neighbors
ball_tree = NearestNeighbors(algorithm='ball_tree', leaf_size=20).fit(text_array)
distances, indices = ball_tree.kneighbors(task_array, n_neighbors=80)

df_output = pd.DataFrame(columns=['Task','Result_Paper_ID','complete_text_tokenized'])

In [None]:
article_df.to_csv("./article_df.csv", sep="," , encoding='utf-8')
selected_article.to_csv("./selected_article.csv", sep=",", encoding='utf-8')

In [None]:
del selected_article
gc.collect()

In [None]:
!pwd

In [None]:
for i, info in enumerate([task]):
    df =  article_df.iloc[indices[i]]
    dist = distances[i]
    papers_ids = df['paper_id']
    titles = df['title']
    complete_texts_tokenized = df['complete_text_tokenized']
    for l in range(len(dist)):
        df_output = df_output.append({'Task': i, 'Result_Paper_ID' : papers_ids.iloc[l], 'complete_text_tokenized' : complete_texts_tokenized.iloc[l]}, ignore_index=True)
df_output.to_csv('df_output.csv', sep=',', encoding='utf-8')
df_output.shape

Visualize key words in articles related to task5

In [None]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from ast import literal_eval
stopwords = set(STOPWORDS)
new_stopwords = ['copyright', "manuscript","funders",'pmc', "et", "europe","al", 'license', 'display', 'author', 'preprint', 'patient', 'authorfunder','ef','using', 'new', 'set', 'yet', 'fully', 'expected', 'medrxiv', 'available', 'granted','futhermore']
new_stopwords_list = stopwords.union(new_stopwords)

def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color='lightblue',
        stopwords=new_stopwords_list,
        max_words=500,
        max_font_size=40, 
        scale=5,
        random_state=2020
    ).generate(str(data))

    fig = plt.figure(1, figsize=(15,15))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=14)
        fig.subplots_adjust(top=2)
  
    plt.imshow(wordcloud)
    plt.show()

df_output = pd.read_csv('df_output.csv')
npi = df_output['complete_text_tokenized']

In [None]:
%matplotlib inline 
lem = WordNetLemmatizer()

words = []
for i in npi : 
    keywords= literal_eval(i)
    for j in keywords:
        words.append(lem.lemmatize(j))

words = ' '.join(words)
show_wordcloud(words, title = 'Task : What do we know about non-pharmaceutical interventions?')

In [None]:
del df_output
gc.collect()

# Q&A Bert Model

## Inference

In [None]:
torch.cuda.is_available()

In [None]:
from transformers import BertTokenizer, BertForQuestionAnswering

device =  "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = BertTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
QA = BertForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad").to(device)

**Start & End Token Classifiers**

`BERT` needs to highlight a `span` of text containing the answer–this is represented as simply predicting which token marks the start of the answer, and which token marks the end.

In [None]:
from IPython.display import display, HTML
def convert_to_str(token_id):
    """
    Convert token id to str
    """
    tokens = tokenizer.convert_ids_to_tokens(token_id)
    return tokenizer.convert_tokens_to_string(tokens)
    

def QA_inference(query, search_on, df):
    """
    Inference processor for Bert model
    ========
    query:       question input [str]
    search_on:   title, abstract, body
    df:          data
    n:           
    """
    # Init 
    token_id, score, span = [], [], []
    
    for i in tqdm(range(len(df))):
        ids = tokenizer.encode(query, df[search_on][i])
        token_type_id = [0 if i <= ids.index(102) else 1 for i in range(len(ids))]
        
        if len(ids) > 512:
            if search_on == "title" or search_on == "abstract":
                ids, token_type_id = ids[:511] + [102], token_type_id[:512]
            else:
                h = (len(ids) - 512)//2 
                ids, token_type_id = ids[h:h+511] + [102], token_type_id[h:h+512]


        # Tensors
        ids_tensor = torch.tensor([ids]).to(device)
        token_type_id_tensor = torch.tensor([token_type_id]).to(device)
        
        # Inferencing
        start_scores, end_scores = QA(ids_tensor, token_type_ids=token_type_id_tensor)
        
        # releasing gpu memory
        ids_tensor, token_type_id_tensor, start_scores, end_scores = \
            tuple(map(lambda x: x.to('cpu').detach().numpy(), (ids_tensor, token_type_id_tensor, start_scores, end_scores)))
        
        span.append([start_scores.argmax(), end_scores.argmax()+1])
        score.append([start_scores.max(), end_scores.max()])
        token_id.append(ids)
    
    span, score = np.array(span), np.array(score)
    return span, score, token_id


In [None]:
def display_res(spans, scores, search_on, token_ids, data, top_n=10):
    min_scores = scores.min(axis=1) 
    sorted_idx = (-min_scores).argsort() # Descending order
    
    counter = 0    
    for idx in sorted_idx:
        if counter >= top_n:
            break
        if spans[idx,0] == 0 or spans[idx,1] == 0 or \
            spans[idx,1]<=spans[idx,0]:
            continue
        start, end = spans[idx, :]

        text = data[search_on][idx]
        highlight = convert_to_str(token_ids[idx][start:end])
        
        start = text.lower().find(highlight)
        if start == -1:
            text = convert_to_str(token_ids[idx]
                                      [token_ids[idx].index(102)+1:])
            start = text.find(highlight)
            end = start + len(highlight)
            text = text[:-5]
        else:
            end = start + len(highlight)
            highlight = text[start:end]
        before, after = text[: start], text[end : ]
    
        # Putting information in HTML format
        html_ = f"<text style=color:red><b>Answer: {highlighted}</b></text><br><br>" + \
                f"<b>({count+1}) {df['title'][i]} </b><br>" + \
                f"Score: {score[i].min()} <br>" + \
                "<p style=line-height:1.5><font size=4>" + \
                before + \
                f"<text style=color:red>{highlighted}</text>" + \
                after + \
                "</font></p>"
        
        display(HTML(html_))
        
        counter += 1

def final(question, search_on, df, top_n=10):
    
    spans, scores, token_ids = QA_inference(question, search_on, df)
    display_res(spans, scores, search_on, token_ids, df, top_n)
    

## Result

In [None]:
Question = "What do we know about non-pharmaceutical interventions?"

In [None]:
final(Question, "abstract", article_df)

![Example1](https://www.kaggle.com/trexwithoutt/plot-anwser/1.png)
    