![T10_Banner.png](attachment:T10_Banner.png)

# COVID-19 U.S. White House Challenge


# 1. Introduction

At **Ericsson**, we are part of an extraordinary family. Not only do we have some of the best and brightest tech minds in the industry, but our teams are also committed to giving back and dedicated to making the world a better place. Therefore, in response to the call to action issued by the White House to the tech community in March 2019 to confront the COVID-19 pandemic, **Ericsson** has decided to partner with the National Institutes of Health, Georgetown University and the White House Office of Science and Technology Policy on their open-research dataset challenge.

The U.S. White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). The CORD-19 dataset represents the most extensive machine-readable coronavirus literature collection of over 47,000 scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses. However, the rapid increase in the volume and type of coronavirus literature makes it difficult for the medical community to keep up. Using data mining tools, we can help the medical community in developing answers to high priority scientific questions related to vaccines and therapeutics.

## 1.1 Ericsson Team-10
#### Our team was in charge of researching about the following COVID-19 related questions:
What do we know about vaccines and therapeutics? 

What has been published concerning research and development and evaluation efforts of vaccines and therapeutics?

## 1.2 Our Ericsson Family Contributors

The following **Team-10** members from the Ericsson family contributed to the development of this COVID-19 Challenge Jupyter Notebook:  

    Bishoy Alphonse      Karanj Rupareliya      Oana-Gabriela Borcoci      Yamini Saragadam
    Danlin Jin           Kiran Krishna Guda     Pramit Mehrotra            Yue Xin
    Disha Goel           Madhava Bhamy          Robin Von	 
    Jieneng Yang	     Nanda Taliyakula       Rohit Rajput	 

Special recognition and appreciation to the following **Team-10** Ericsson Family members whose diligence, self-motivation, and dedication to always go the extra mile to achieve the best possible results for this COVID-19 Challenge is truly remarkable:

    Amanda Perez         Diego Martos           Jim Reno                    Ricardo Omana
    Debasis Maity        Dimple Thomas          Jing Hu                     Serveh Shalmashi
    Deidre Marshall      Emmet Moore            Melissa Hatfield            Sneha Wadhwa
    Denis Shleifman      Forough Yaghoubi       Mukunda Prasad Jena         Venkata Snehith Reddy
    Derrick Hernandez    Hernan Peniche         Omar Nushaiwat              Wilfredo Velez

# 2. Notebook’s Goal 

The goal of this notebook is to help the medical community in finding answers to the questions below as it relates to COVID-19.   Using data science as well as data mining techniques, and tools, this notebook intends to assist the medical community in quickly finding the most relevant scholarly articles that could help answer these questions.
* Effectiveness of drugs being developed and tried to treat COVID-19 patients.
* Methods evaluating potential complications of Antibody-Dependent Enhancement (ADE) in vaccine recipients.
* Exploration of the use of best animal models and their predictive value for a human vaccine.
* Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.
* Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up. This could include identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need.
* Efforts targeted at a universal coronavirus vaccine.
* Efforts to develop animal models and standardize challenge studies
* Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers
* Approaches to evaluate risk for enhanced disease after vaccination
* Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models (in conjunction with therapeutics)

![T10_Methodology.png](attachment:T10_Methodology.png)

# 3. Methodology 

The workflow in the figure above illustrates the approach taken pre-process, model, and post-process the cord-19 data set.


The first step is the pre-processing.  It involves basic data cleansing across the dataset, applying a proper selection of the paper’s language, availability of full research document, and other basic elements.
At the core of the modeling step, a Natural Language Processing (NLP) library is found (1).  It provides the possibility to execute different tasks like clustering, summarization, phrase matching, a ranking of the given research papers.


Finally, the selection of the most relevant documents is conducted considering the representation of presented words and sentences using a vector form.  This functionality provides a way to measure the semantic similarity of a given search sub-task.

## 3.1 Pre-processing
### 3.1.1 Dataset description
Each paper in the dataset is represented by a JSON file, and is located in one of four subdirectories under the "/kaggle/input/CORD-19-research-challenge" directory, depending on how the article is licensed:

* "comm_use_subset"
* "noncomm_use_subset"
* "custom_license"
* "biorxiv_medrxiv"

For each paper, we want to extract the following data:

* paper ID
* publication date
* title
* abstract text
* body text
* primary location and country for the author(s)

### 3.1.2 Data cleaning 
The following steps were considered for dataset cleaning:

* Remove unnecessary or unhelpful characters and words from the paper text
* Remove duplicate papers
* Remove papers which are not in English
* Eliminate null values 
* Remove blank space
* Removes references and annotations

### 3.1.3 Entity creation  

A set of lists were prepared with keywords to reduce the dataset that will be used to identity the relevenat articles. 

The following table shows the Keyword lists used with the number of keywords under each list:

![KeywordLists.png](attachment:KeywordLists.png)


## 3.2 Modeling


Using relevant articles for each question, we are using spaCy similarity to find out words close semantically. This is accomplished by finding similarity between word vector in the vector space. spaCy is one of the fastest natural language processing libraries used widely.


SpaCy phrase matching, name entity recognition (NER) search categories specified entities in a text. The phrase matching finds phrases that match entities, this engine not only let you find words and phrases, it also gives the correlate document. 


## 3.3 Post-procesing


In order to extract the relevant paragraph or sentence, we are using BERT word embedding and NLKT clustering. The cluster is going to start with an arbitrary k until allocate the close vector. Gensim includes streamed parallelized implementations of fastText, non-negative matrix factorization (NMF), latent Dirichlet allocation (LDA), word2vec and doc2vec algorithms. 


SpaCy phrase matching, name entity recognition (NER) search categories specified entities in a text. The phrase matching finds phrases that match entities, this engine not only let you find words and phrases, it also gives the correlate document. 


## 3.4 Pros

- Easy, the same program can be used for all subtask. 
- Accurancy, creation of specific key words helps to naildown pertinent papers.  
- Faster, independent researchers showed that spaCy offered the fastest syntactic parser in the world and that its accuracy is within 1%. 

## 3.5 Cons 

- Text summarization helps to allocate the exact parapragh. This helps us to identify that additonal improvement can be done. 

#  4. The Code

The major preprocessing we have performed includes:

1. Preparation
2. Acquiring and preprocessing the dataset
3. Cleaning the dataset 

## 4.1 Preparation

For the preparation task, required software packages are imported and installed, and helper functions are defined to make the code more readable and efficient.

### 4.1.1 Installing and Importing the Necessary Python Packages

The following code blocks install the necessary Python packages on the system, and import them into the Python environment.

In [None]:
!pip install cord-19-tools
!pip install spacy-langdetect
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz

In [None]:
import cotools as co
import gc

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from IPython.core.display import display, HTML
from collections import defaultdict
import functools
import spacy
from spacy.matcher import PhraseMatcher
from spacy_langdetect import LanguageDetector
import en_core_sci_lg
import os
import re
import sys
import glob
  
from sklearn import cluster
from sklearn import metrics
from sklearn.manifold import TSNE

#from bert_serving.client import BertClient  # if using bert

from gensim.models import Word2Vec
from gensim.summarization.summarizer import summarize 
from gensim.summarization import keywords 

from tqdm.notebook import tqdm

from nltk.corpus import stopwords
from string import punctuation
from nltk.stem.lancaster import LancasterStemmer
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.probability import FreqDist
from nltk.cluster import KMeansClusterer
import nltk
from heapq import nlargest

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

In [None]:
pd.options.mode.chained_assignment = None

### 4.1.2 Defining Useful Helper Functions

The following two helper functions are used as part of the code.

```log_progress()``` uses the *ipwidgets* Python package to create and display a progress bar which gives the user running the notebook an indication of the progress of code execution for certain portions of the notebook where this function is called.


```process_text()``` uses the *spaCy* package natural language processing capabilities to clean and lemmatize an input text string.  The function creates a spaCy "Doc" object from the input string, which is a list of tokens that break down the string into its constituent parts; e.g., individual words, spaces, punctuation marks, and lemmas corresponding to each word. It then runs through the tokens, screens out punctuation, stop words, and pronouns, and returns a list of the lemmas corresponding to each remaining word.

In [None]:
# Use the ipwidgets package to display a progress bar to give the user an indication
# of the progress of code execution for certain portions of the notebook.
# This function is intended to be used in the definition of a for loop to indicate
# how far execution has gotten through the object being iterated over in the for loop.
#
# Inputs: (sequence) - contains the for loop iteration (e.g., list or iterator)
#         (every)    - number of steps to display
#
# Outputs: displays the progress bar in the notebook
#          (yield record) - returns the current iteration object back to the calling for loop
#
def log_progress(sequence, every=None, size=None, name='Items'):
    
    '''
    Tracks the progress of for loop iteration
    
    Inputs: (sequence) - contains the for loop iteration
            (every) - the number of steps to display
    Outputs: (display) - a tracking bar that shows the progress of for loop iteration
    '''
    
    from ipywidgets import IntProgress, HTML, VBox
    from IPython.display import display

    is_iterator = False
    if size is None:
        try:
            size = len(sequence)
        except TypeError:
            is_iterator = True
    if size is not None:
        if every is None:
            if size <= 200:
                every = 1
            else:
                every = int(size / 200)     # every 0.5%
    else:
        assert every is not None, 'sequence is iterator, set every'
    
    # Instantiate and display the progress bar        
    if is_iterator:
        progress = IntProgress(min=0, max=1, value=1)
        progress.bar_style = 'info'
    else:
        progress = IntProgress(min=0, max=size, value=0)
    label = HTML()
    box = VBox(children=[label, progress])
    display(box)
    
    # Update the progress bar state at each iteration of the for loop using this function
    index = 0
    try:
        for index, record in enumerate(sequence, 1):
            if index == 1 or index % every == 0:
                if is_iterator:
                    label.value = '{name}: {index} / ?'.format(
                        name=name,
                        index=index
                    )
                else:
                    progress.value = index
                    label.value = u'{name}: {index} / {size}'.format(
                        name=name,
                        index=index,
                        size=size
                    )
                    
            # return the current iteration object, preserving the state of the function
            yield record
    except:
        progress.bar_style = 'danger'
        raise
    else:
        progress.bar_style = 'success'
        progress.value = index
        label.value = "{name}: {index}".format(
            name=name,
            index=str(index or '?')
        )
        
# Use the spaCy library natural language processing capabilities to clean an input text, 
# in string format, for punctuation, stop words, and lemmatization.
#
# Inputs (text) - a string to clean and lemmatize
#
# Outputs - a modified version of the input string that has been cleaned by removing punctuation, 
#           stop words, and pronouns, and has had the remaining words converted into corresponding lemmas
#         
def process_text(text):
    
    '''
    Cleans an input text in string format for punctuation, stopwords and lemmatization
    
    Inputs: (string) - input text
    Outputs: (string) - a cleaned output that removes punctuation, stopwords and converts words to lemma
    '''
    
    # Create a spaCy "Doc" object from the input text string.
    doc = nlp(text.lower())
    result = [] # list that will contain the lemmas for each word in the input string
    
    for token in doc:
        
        if token.text in nlp.Defaults.stop_words: #screen out stop words
            continue
        if token.is_punct:                        #screen out punctuations
            continue
        if token.lemma_ == '-PRON-':              #screen out pronouns
            continue
        
        result.append(token.lemma_)
    
    # Return the lemmatized version of the cleaned input text string
    return " ".join(result)

## 4.2 Acquiring and Preprocessing the Dataset

In this step, downloading the CORD-19 dataset and extracting and formatting the data are explained.

### 4.2.1 Downloading the CORD-19 Dataset from Kaggle

**ATTENTION**
The need to download the CORD-19 dataset from Kaggle is applicable only if this notebook is downloaded and run outside of Kaggle. If this is the case, follow the instructions on the Kaggle site. 

If wanting to write code for downloading the dataset, the following steps can be used as guidance:
1. Import the Kaggle API
2. Set the local directory where you want to download the dataset
3. Set the Kaggle account credentials, and establish an authenticated Kaggle API instance.
4. Download the CORD-19 Research Challenge dataset using the Kaggle API method

### 4.2.2 Extracting and Formatting the Data for all Papers in the Dataset

We can now load, extract, and reformat the data for the papers into a format suitable for further analysis. The raw data for the CORD-19 Research Challenge is accessible on the Kaggle site
at the following path:  "/kaggle/input/CORD-19-research-challenge". 

In addition, for storing the formatted data, we will use "/kaggle/working/".

#### 4.2.2.1 Loading the Metadata

The *metada.csv* file, located in the **Input** directory, contains useful summary information about each article in the dataset. The information provided for each paper includes (but is not limited to) the following:

* title
* license under which the paper is published
* abstract text
* publication date
* authors
* publishing journal
* whether the data includes the full text of the paper

We use metadata to extract some additional information not found in the JSON files for each paper. The follow code reads in the *metadata.csv* file and converts it into a *pandas* DataFrame for further processing.

In [None]:
# Set the path where the raw data is
data_dir = '/kaggle/input/CORD-19-research-challenge'
# Set the current working directory path to where the raw data is
os.chdir(data_dir)

# Set the path where the formatted data will be stored
output_dir = '/kaggle/working/'

# Read in the metadata.csv file as a pandas DataFrame
metadata_information = pd.read_csv('metadata.csv')

#### 4.2.2.2 Creating the Preprocessed Dataset 

From the output of the metadata_information.shape method shown below, we see that the metadata file has information on 51078 papers, but the raw data in the subdirectories includes JSON files for 59,311 papers. Thus, not all the papers in raw data are present in the metadata file. For our analysis, we will use the papers that have JSON files, as this should provide more contextual understanding and answers for researchers.
**Note**: 51,078 indicates the number of rows in the resulting output file while 18 represents the number of columns. Each row represents a paper and each column represents the metadata associated with that paper.

In [None]:
metadata_information.shape

#### Extracting Relevant Papers based on Date

In order to save memory, we include the option here to filter for papers published after a specified date. The function below, ```is_date()```, takes a string as input and uses the *dateutil* package to see if the string can be interpreted as a date. The function returns "True" if it can; "False" otherwise.

The code block following the ```is_date()``` function then takes a date as a string input (our default value is "2019-12-01"), and then identifies the IDs of from the metadata of those papers that were published after this date. To set a specific date, simply modify the "filter_date" variable appropriately, or enter a non-valid date string to disable this filtering altogether. The result of this code block, "paper_id_list", is the list of IDs for papers published after the specified date (or all papers if the filtering is disabled).

In [None]:
# Checks if string input can be interpreted as a date
#    
# Inputs: (string) - string to check whether it is a valid date
# Outputs: (bool) - True if string is a valid date; False otherwise
#
def is_date(string, fuzzy=False):
    
    '''
    Checks if string input can be interpreted as a date
    
    Inputs: (string) - to check if date interpretation is possible
    Outputs: (bool) - True if possible else False
    '''
    from dateutil.parser import parse
    
    try: 
        parse(string, fuzzy=fuzzy)
        return True

    except ValueError:
        return False

From the output below, we see that the metadata file has information on 51078 papers, but the raw data in the subdirectories includes JSON files for only 3890 papers. Thus, not all the papers in the metadata file are present in the raw data. For our analysis, we will use the papers that have JSON files, as this should provide more contextual understanding and answers to the researchers.

Iterating through all the seven available datasets and extracting JSON information into list format in preparation for conversion into DataFrame.

In [None]:
#print('Please input in the earliest date to filter the research paper (yyyy-mm-dd)!')
#filter_date = str(input())

# Modify this date per user requirements, or enter a non-valid date string to disable publication date filtering
filter_date = '2019-12-01'

# paper_id_list is a list of the IDs for all papers published after the specified date
# (or all papers if the date filtering is disabled).

if is_date(filter_date) == True:
    paper_id_list = metadata_information[metadata_information['publish_time'] >= filter_date].dropna(subset=['sha'])['sha'].tolist()
    
else:
    paper_id_list = metadata_information['sha'].tolist()

#### 4.2.2.3 Extracting Key Data from each Paper

Now, we want to extract relevant data from each of the papers. The next code block accomplishes this as outlined in the descriptive subsections below.

##### Extract Title, Abstract, Text, and Paper ID Information for Each Paper

Next, we extract the title, abstract, full text, and paper ID for each paper, across all 7 sub-directories. We take all the paper titles, abstracts, etc., and put them in separate lists. The result of this code is that we have separate lists (most of which are lists of strings) which contain:

- the title of each paper 
- the abstract of each paper
- the full text of each paper
- the ID of each paper

In [None]:
def create_library(list_of_folders, list_of_papers = paper_id_list):
    
    import json
    #Internal Library
    internal_library = []

    for i in log_progress(list_of_folders, every = 1):

        try:

            pdf_file_path = data_dir + '/' + i + '/' + i + '/pdf_json'
            pdf_file_list = [i for i in os.listdir(pdf_file_path) if i.split('.')[0] in list_of_papers]
            print('There are {a} papers in the {c} group after {b}.'.format(a = len(pdf_file_list), b = filter_date, c = str(i + str('_pdf'))))

            for each_file in pdf_file_list:
                file_path = data_dir + '/' + i + '/' + i + '/pdf_json/' + each_file

                with open(file_path) as f:
                    data = json.load(f)

                internal_library.append(data)

        except:
            continue

        try:

            pmc_file_path = data_dir + '/' + i + '/' + i + '/pmc_json'
            pmc_file_list = [i for i in os.listdir(pmc_file_path) if i.split('.')[0] in list_of_papers]
            print('There are {a} papers in the {c} group after {b}.'.format(a = len(pmc_file_list), b = filter_date, c = str(i + str('_pmc'))))

            for each_file in pmc_file_list:
                file_path = data_dir + '/' + i + '/' + i + '/pmc_json/' + each_file

                with open(file_path) as f:
                    data = json.load(f)

                internal_library.append(data)

        except:
            continue
            
    return internal_library

def data_creation(list_of_folders, metadata, date = filter_date, list_of_papers = paper_id_list):
    
    '''
    Converts JSON files into CSV based on various criteria
    
    Inputs: (list) - List_of_folders ; the names of the sub-directories in the library
            (DataFrame) - metadata ; metadata information provided in the library
            (string) - date ; filter criteria on publishing date information available in metadata
            (list) - list_of_papers ; containing the index information of papers published after date
    Outputs: (DataFrame) - dataframe containing on relevant papers
    '''    
    internal_library = create_library(list_of_folders = selected_folders, list_of_papers = paper_id_list)

    title_list = []          # list of paper titles
    abstract_list = []       # list of paper abstracts
    text_list = []           # list of paper full texts

    # Extracting title, abstract and text information for each paper
    # each_dataset is a list of dictionaries, where each dictionary corresponds to one paper
    for i in list(range(0, len(internal_library))):

        title_list.append(internal_library[i].get('metadata').get('title'))

        try:
            abstract_list.append(co.abstract(internal_library[i]))
        except:
            abstract_list.append('No Abstract')

        text_list.append(co.text(internal_library[i]))

    #Extracting Paper ID Information
    paper_id = [i.get('paper_id') for i in internal_library]   # list of the ID for each paper

    #Extracting the location and country that published the research paper
    primary_location_list = []      # list of the primary locations for the authors of each paper
    primary_country_list = []       # list of the primary countries for the authors of each paper

    # Extracting the primary location, and country for the authors of each paper
    # each_dataset is a list of dictionaries, where each dictionary corresponds to one paper

    # Extract list of metadata dictionaries for each paper
    internal_metadata = [i['metadata'] for i in internal_library]

    # individual_paper_metadata is the 'metadata' dictionary for one paper
    for individual_paper_metadata in internal_metadata:

        # Extract the list of author dictionaries for the current paper
        authors_information = individual_paper_metadata.get('authors')

        if len(authors_information) == 0:
            primary_location_list.append('None')
            primary_country_list.append('None')

        else:
            location = None
            country = None
            i = 1

            # Find the first author of the paper with valid data for "institution",
            # location, and country, extract this information, and add to
            # the respective lists for all the papers
            while location == None and i <= len(authors_information):

                if bool(authors_information[i-1].get('affiliation')) == True:

                    location = authors_information[i-1].get('affiliation').get('location').get('settlement')
                    country = authors_information[i-1].get('affiliation').get('location').get('country')

                i += 1

            primary_location_list.append(location)
            primary_country_list.append(country)
                
    #Loading all the extracted information into a DataFrame for merger
    index_df = pd.DataFrame(paper_id, columns =  ['paper_id'])

    geographical_df = pd.DataFrame(primary_location_list, columns = ['Location'])
    geographical_df['Country'] = primary_country_list

    paper_info_df = pd.DataFrame(title_list, columns = ['Title'])
    paper_info_df['Abstract'] = abstract_list
    paper_info_df['Text'] = text_list
    
    #This dataframe contains all the information extracted from the JSON files and converted into CSV.
    combined_df = pd.concat([index_df, geographical_df, paper_info_df], axis = 1)
    
    #Creating the merger between the metadata (45000+) and the research papers (33000+)
    part_1 = metadata[['sha', 'abstract', 'url', 'publish_time']]

    test_df = combined_df.merge(part_1, left_on = ['paper_id'], right_on = ['sha'], how = 'left')
    test_df.drop(['sha'], axis = 1,inplace = True)
    test_df = test_df[['paper_id', 'url', 'publish_time', 'Location', 'Country', 'Title', 'Abstract', 'abstract', 'Text']]
    
    #In the event where the JSON's abstract is null but there is an abstract in the metadata, it will be used as a substitute.
    test_df['Abstract'] = np.where(test_df['Abstract'] == '', test_df['abstract'], test_df['Abstract'])
    test_df.drop(['abstract'], axis = 1, inplace = True)
    
    gc.collect()
    
    return test_df

##### Creating a List of Datasets that Correspond to Raw Data Subdirectories

Each article in the dataset is located in one of four subdirectories under the "Raw Data" directory, depending on how the article is licensed:
- "comm_use_subset"
- "noncomm_use_subset"
- "custom_license"
- "biorxiv_medrxiv"

The following code block reads in the papers from each subdirectory using the ``Paperset`` method from the COVID-19 Data Tools package. The detailed information for each paper is formatted into a Python dictionary, and the result of the ``Paperset`` method is a list of dictionaries, where each dictionary describes one paper in the subdirectory. In total then we create four lists of dictionaries - one list for the papers in each subdirectory [comm_use_subset; noncomm_use_subset; biorxiv_medrxiv; pmc_custom_license].

In [None]:
# Create a list of the datasets corresponding to each subdirectory over which we can iterate

# Define as a list the names of all the subdirectories in the 'Raw Data'
# directory where the dataset files are stored
selected_folders = ['comm_use_subset', 'noncomm_use_subset', 'custom_license', 'biorxiv_medrxiv']
test_df = data_creation(list_of_folders = selected_folders, metadata = metadata_information)

In [None]:
test_df.columns

### 4.2.3 Checkpoint 1

At this point, we want to save our combined DataFrame as a .csv file. We also want to delete the variables we no longer need to reclaim memory space before continuing code execution.

In [None]:
test_df.to_csv(output_dir + 'Checkpoint_1.csv', index = False)

In [None]:
#Cleaning up after each section to save space
del paper_id_list
del metadata_information
del selected_folders

import gc
gc.collect()

## 4.3 Cleaning the Dataset

Now that we have extracted the relevant raw data for all the papers and reformatted it into a single *pandas* DataFrame, we need to do some additional cleaning of the dataset. This includes:

- removing unecessary or unhelpful characters and words
- removing duplicate papers
- making the country names uniform english names

### 4.3.1 Cleaning the Paper Text Sections

The text extracted for the paper JSON files is quite dirty. The code below cleans the various text sections that we extracted from the raw data for each paper. Specifically, it:

- fills holes (i.e., null values) in the DataFrame with the string, "No Information"
- removes unnecessary garbage characters and white space
- removes references and annotations (i.e., [1], (1), etc.)
- removes "figure X.X" references 

Note that we are using the ```log_progress()``` helper function defined above to provide an indication of the progress of the execution.

Additionally for the abstract text of each paper, it removes unnecessary starting words, such as "background" or "abstract", and also counts the number of words in each abstract and adds a column to the "test_df" DataFrame which contains the number of words in the abstract for each paper.

In [None]:
def cleaning_dataset(dataset, columns_to_clean):
    
    # each_column is on of the defined text section columns from the DataFrame
    # Use the log_progress() helper function defined above to indicate the progress of the execution
    for each_column in log_progress(columns_to_clean, every = 1):

        # Fill in any null text items with "No Information"
        dataset[each_column] = dataset[each_column].fillna('No Information')

        # Remove square-bracketed references (i.e., [1])
        dataset[each_column] = dataset[each_column].apply(lambda x: re.sub(r'\[.*?]', r'', x))

        # Remove parenthesis references (i.e., (1))
        dataset[each_column] = dataset[each_column].apply(lambda x: re.sub(r'\((.*?)\)', r'', x))

        # Remove garbage characters
        dataset[each_column] = dataset[each_column].apply(lambda x: re.sub(r'[^a-zA-z0-9.%\s-]', r'', x))

        # Remove unnecessary white space
        dataset[each_column] = dataset[each_column].apply(lambda x: re.sub(r' +', r' ', x))

        # Remove unnecessary white space at the end of the text section
        dataset[each_column] = dataset[each_column].apply(lambda x: x.rstrip())

        # Remove white space before punctuation marks
        dataset[each_column] = dataset[each_column].apply(lambda x: re.sub(r'\s([?.!"](?:\s|$))', r'\1', x))

    cleaned_abstract = []     # list of cleaned abstracts for all the papers
    abstract_count = []       # list of the word counts for each paper abstract

    # Clean up abstracts as abstracts may contain unnecessary starting words like 'background' or 'abstract'
    # Count the words in each cleaned abstract and add the list of abstract word counts for each paper to
    # the test_df Data Frame
    #
    # i is the abstract text (string) for one paper
    for i in dataset['Abstract']:

        if i.split(' ')[0].lower() == 'background' or i.split(' ')[0].lower() == 'abstract':
            cleaned_abstract.append(' '.join(i.split(' ')[1:]))
            abstract_count.append(len(i.split(' ')[1:]))

        else:
            cleaned_abstract.append(i)
            abstract_count.append(len(i.split()))

    dataset['Abstract'] = cleaned_abstract
    dataset['Abstract Word Count'] = abstract_count

    # Removing the words figure X.X from the passages because it contributes no meaning
    fig_exp = re.compile(r"Fig(?:ure|.|-)\s+(?:\d*[a-zA-Z]*|[a-zA-Z]*\d*|\d*)", flags=re.IGNORECASE) 
    dataset['Text'] = [(re.sub(fig_exp, '', i)) for i in test_df['Text']]

    # Remove other instances of poor references and annotations
    poor_annotation_exp_1 = re.compile(r'(\d)\s+(\d]*)', flags = re.IGNORECASE)
    dataset['Text'] = [(re.sub(poor_annotation_exp_1, '', i)) for i in test_df['Text']]

    poor_annotation_exp_2 = re.compile(r'(\d])*', flags = re.IGNORECASE)
    dataset['Text'] = [(re.sub(poor_annotation_exp_2, '', i)) for i in test_df['Text']]
    
    gc.collect()
    
    return dataset

In [None]:
## Cleaning up Dataset in the selected text columns
text_columns = ['Title', 'Abstract', 'Text']
test_df = cleaning_dataset(dataset = test_df, columns_to_clean = text_columns)

### 4.3.2 Removing Duplicate Papers

If we examine some characteristics of the "Abstract" column of the "test_df" DataFrame, we notice that there appear to be a significant number (~700+) of duplicate abstracts in the data set. One possible reason for this is that a given paper may have been published in more than one journal. We obviously do not want such duplicates. The following code looks for duplicated papers, indicated either by duplicate abstract text or duplicate full text, and removes the corresponding rows from the DataFrame. In the output below we see that dropping the apparent duplicates has reduced the number of non-unique text entries in the DataFrame to less than a hundred.

In [None]:
test_df['Abstract'].describe(include='all')

In [None]:
test_df.drop_duplicates(['Abstract', 'Text'], inplace = True)

In [None]:
test_df['Text'].describe(include = 'all')

#### 4.3.2.1 Checkpoint 2

In [None]:
test_df.to_csv(output_dir + 'Checkpoint_2.csv', index = False)

In [None]:
test_df.columns

### 4.3.3 Identifying and Removing Papers that are Outliers

To further clean the dataset, we want to remove any papers that would be classified as outliers. 
An outlier would be a paper that:

- has no text
- has text that does not appear to be relevant to the corpus as a whole, based on TF-IDF scoring
- has less than 150 words of text

The following code identifies and removes outlier papers based on these criteria.

The first step is easy - we simply use the *pandas* ```.dropna()``` method to drop any papers that have a null value in the "Text" column (paper full text) of our DataFrame.

#### 4.3.3.1 Identifying and Removing Papers with No Body Text

The first step is straightforward - we simply use the *pandas* ```.dropna()``` method to drop any papers that have a null value in the "Text" column (paper full text) of our DataFrame. While there are no research paper without any text in this current library, there is a possibility that it will exist in a larger library. We will drop them as they yield limited information.

In [None]:
test_df.dropna(subset = ['Text'], inplace = True)

#### 4.3.3.2 Removing Papers that Do Not Appear to be Relevant to the Corpus as a Whole

To identify papers do not appear to be relevant to the corpus as a whole - i.e., research on COVID-19 and related coronaviruses - we perform TF-IDF analysis, using trigrams, across all the paper text. Once we have the corpus vocabulary (features) and TF-IDF scores, we use t-SNE dimensionality reduction to compare the TF-IDF paper scores and highlight papers with scores that significantly differ from those of the main body of papers. A significant difference in TF-IDF score indicates that the content of the paper is not that closely related to that of the main body of papers.

The code below uses the *scikit-learn* ```TfidfVectorizer``` class to extract trigrams (features) and and compute TF-IDF scores for the text of all the papers in the dataset. Since the number of features across 50,000+ papers becomes quite large (63+ million), the code then uses t-SNE dimensionality reduction to project the paper TD-IDF scores into a three dimensional space that can be visualized and in which outliers can be identified per standard criteria.

In [None]:
def dimension_reduction(dataset, n = 3, n_components = 3, use_hashing_vectorizer = False):

    dataset = dataset.reset_index().drop(['index'], axis = 1)
    
    #Extracting Trigrams vectors for all 3885 documents
    if use_hashing_vectorizer == False:
    
        vectorizer=TfidfVectorizer(ngram_range=(n,n))
        vectorized_vectors=vectorizer.fit_transform(dataset['Text'].tolist())
        
    else:
        
        vectorizer=HashingVectorizer(ngram_range=(n,n))
        vectorized_vectors=vectorizer.fit_transform(dataset['Text'].tolist())

    #Dimensionality Reduction
    tsne_reduction = TSNE(n_components = 3, perplexity = 10, learning_rate = 100, random_state = 777)
    tsne_data = tsne_reduction.fit_transform(vectorized_vectors)

    #Converting components of T-SNE into dataframe
    tsne_df = pd.DataFrame(tsne_data, columns = [i for i in range(0, tsne_data.shape[1])])
    gc.collect()
    return tsne_df

def visualizing_dimensions(dataset):

    fig = plt.figure(1, figsize=(7, 5))
    ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

    ax.scatter(dataset[0], dataset[1], dataset[2], c=dataset[2], cmap='viridis', linewidth=0.5)

    ax.set_xlabel('Component A')
    ax.set_ylabel('Component B')
    ax.set_zlabel('Component C')

    plt.show()
    gc.collect()
    
def outlier_removals(dim_reduced_dataset, dataset, n_components = 3, number_std_dev = 2.5, verbose = 1):
    
    outlier_papers = []
    print('{a} standard deviation is being used to clean the dataset.'.format(a = number_std_dev))
    print()
    for i in range(0, n_components):
        
        upper = dim_reduced_dataset[i].mean() + number_std_dev*dim_reduced_dataset[i].std()
        lower = dim_reduced_dataset[i].mean() - number_std_dev*dim_reduced_dataset[i].std()

        outlier_df = dim_reduced_dataset[(dim_reduced_dataset[i] >= upper) | (dim_reduced_dataset[i] <= lower)]
        outlier_list = outlier_df.reset_index()['index'].tolist()
        
        outlier_papers += outlier_list
        
    outlier_papers = list(set(outlier_papers))
    
    if verbose == 1:
        print('There are {a} outlier papers identified.'.format(a = len(outlier_papers)))
        print()
        
    outlier_papers_df = dataset.iloc[outlier_papers,:]
    
    if verbose == 1:
        print('These are the texts that are determined as abnormal.')
        print()
        for i in outlier_papers_df['Text']:
            print(i)
            print()
    
    #remove outliers
    cleaned_df = dataset.drop(outlier_papers, axis = 0)
    cleaned_df.reset_index().drop(columns = ['index'], axis = 1)
    gc.collect()
    return cleaned_df

def full_cleaning_process(dataset, n = 3, n_components = 3, use_hashing_vectorizer = False, std_dev = 3, verbose = 1):
    
    starting_datashape = dataset.shape[0]
    dim_reduced_dataset = dimension_reduction(dataset, n = n, n_components = n_components, use_hashing_vectorizer = use_hashing_vectorizer)
    print('Before Cleaning Up -')
    visualizing_dimensions(dim_reduced_dataset)
    output_df = outlier_removals(dim_reduced_dataset, dataset, n_components = n_components, number_std_dev = std_dev, verbose = verbose)
    ending_datashape = output_df.shape[0]
    print('{a} rows were dropped in this cleaning process.'.format(a = starting_datashape - ending_datashape))
    print()
    print('After Cleaning Up -')
    visualizing_dimensions(dimension_reduction(output_df, n = 3, n_components = 3, use_hashing_vectorizer = False))
    gc.collect()
    return output_df

In [None]:
test_df = full_cleaning_process(test_df, std_dev = 2.5)

The three-dimensional graph above shows a representation of the TD-IDF paper scores after t-SNE dimensionality reduction. We see that most of the papers are clustered together, with handful of papers exhibiting scores significantly outside the main cluster. This handful of papers are outliers that we want to remove from the dataset.

To more quantitatively identify the outlier papers, we use a standard 95% confidence interval criterion. Specifically, we select those papers with scores more than 2.5 standard deviations (this value is selected based on experimentation) away from the mean of the main cluster in any of the three axes.

From the printout, we are able to see the number of outlier papers and its contents.

#### 4.3.3.3 Removing Papers with Less Than 150 Words 

The last step in elimination outliers from the dataset is to identify papers containing less than 150 words. The minimum of 150 words has been derived using numerous rounds of experimentation. It also supports the minimum input requirements to generate a smart summary for a paper as part of our results.

The code below creates a word count for each paper and adds the list of word counts as a new column in the DataFrame. It then identifies those papers that contain less than 150 words of text.

In [None]:
minimum_word_count = 150

test_df = test_df.reset_index().drop(['index'], axis = 1)
test_df['Text Word Count'] = [len(i.split()) for i in test_df['Text']]

dirty_list = []

for index, value in test_df.iterrows():
    
    if (value['Text Word Count'] <= minimum_word_count):
        dirty_list.append(index)
        
weird_papers_df = test_df.iloc[dirty_list,:]

for index, value in weird_papers_df.iterrows():
    print(value['Text Word Count'], value['Text'])
    print()

Taking a look at the text of the papers with less than 150 words, we can see two things. Firstly, these papers contain little to no information that is useful to provide context or insights to medical researchers. Secondly, some of these papers have non-English content. We remove these papers from the data set, creating a new DataFrame, "super_cleaned_df".

In [None]:
test_df = test_df.drop(dirty_list, axis = 0)
test_df = test_df.reset_index().drop(['index'], axis = 1)

#### 4.3.3.4 Checkpoint 3

In [None]:
test_df.to_csv(output_dir + 'Checkpoint_3.csv', index = False)

In [None]:
#Cleaning up after each section to save space
gc.collect()

### 4.3.4 Identifying and Removing Papers whose Primary Language is not English

It is possible that the dataset at this point contains papers that are not in English. As a final cleaning step, we want to remove from the dataset papers whose primary language is not English. This ensures that the subsequent machine learning techniques we employ to help answer the questions for Task 1 are performed on a dataset that is as clean as possible.

In [None]:
#Scientific NLP library has been loaded to find articles that may or may not be in english. This acts as a final data
#clean-up, ensuring that the subsequent ML techniques are performed on as clean a dataset as possible.
nlp = en_core_sci_lg.load()
nlp.add_pipe(LanguageDetector(), name="language_detector", last=True)

In [None]:
#There exists a possibility that in this library of research papers, there exists non-English papers.

#Since some texts exceed the maximum length and may cause memory allocation, we will cut-off the text at the maximum length
#instead of changing it - to control computational resources

language_list = []
for i in log_progress(test_df['Text'], every = 1):
    
    if len(i) <= 1000000:
    
        doc = nlp(i)
        language_list.append(doc._.language)
        
    else:
        
        cut_off_index = i[:1000000].rfind('.')
        focus_i = i[:cut_off_index + 1]
        
        doc = nlp(focus_i)
        language_list.append(doc._.language)

In [None]:
#Storing information on the language detected of the paper. This score provides an indication
#of how much of the paper is in that particular language detected - helping us deal with papers with a combination of languages.
filtered_language_list = [i['language'].upper() for i in language_list]
test_df['Language'] = filtered_language_list

#Filtering out only research papers in English to perform topic modelling.
english_df = test_df[test_df['Language'] == 'EN']
print('There are {a} research papers in English out of {b} research papers.'.format(a = english_df.shape[0], b = test_df.shape[0]))

### 4.3.4.1 Checkpoint 4

Before continuing, we again save our current version of the "english_df" DataFrame as "/kaggle/working/Checkpoint_4.csv", and perform a memory cleanup.

In [None]:
# drop off Language column, as all articles are English
english_df.drop(columns='Language', inplace=True)

In [None]:
english_df.shape

In [None]:
english_df.to_csv(output_dir + 'Checkpoint_4.csv', index = False)

In [None]:
#Cleaning up after each section to save space
gc.collect()

## Creating the name entities based phrase matching

In [None]:
# Load libraries 
import os 
import numpy as np 
import pandas as pd 
import glob
import gc

from tqdm.notebook import tqdm

# Load word cloud function
from wordcloud import WordCloud, STOPWORDS 
import matplotlib.pyplot as plt 

import spacy
from spacy.matcher import PhraseMatcher #import PhraseMatcher class
nlp = spacy.load('en_core_web_lg') # Language class with the English model 'en_core_web_lg' is loaded

### <font color=blue><b>Support Functions</b>
***

In [None]:
def wordcloud_draw(text, color = 'white'):
    """
    Plots wordcloud of string text after removing stopwords
    """
    cleaned_word = " ".join([word for word in text.split()])
    wordcloud = WordCloud(stopwords=STOPWORDS,
                      background_color=color,
                      width=1000,
                      height=1000
                     ).generate(cleaned_word)
    plt.figure(1,figsize=(15, 15))
    plt.imshow(wordcloud)
    plt.axis('off')
    display(plt.show())

### <font color=blue><b>Load clean dataset</b>
***

In [None]:
# Load checkpoint #4

df=pd.read_csv(output_dir + 'Checkpoint_4.csv')
df.shape

##### Load predefined reference lists

In [None]:
# clean up
del test_df
del english_df

gc.collect()

In [None]:
# Set list data directory
lists_data_dir = '/kaggle/input/task10lists/master'
os.chdir(lists_data_dir)
os.getcwd()

In [None]:
# Load list of therapeutics:
df_therapeutics = pd.read_csv('therapeutics.csv')
#df_therapeutics.shape
#df_therapeutics.head()
therapeutics_list = df_therapeutics.iloc[:, 0].tolist()
print(therapeutics_list)

In [None]:
# Load list of vaccines:
df_vaccine = pd.read_csv('vaccines.csv')
#df_vaccine.shape
#df_vaccine.head()
vaccine_list = df_vaccine.iloc[:, 0].tolist()
print(vaccine_list)

In [None]:
# Load list of animals:
df_animals = pd.read_csv('animals.csv')
#df_animals.shape
#df_animals.head()
animals_list = df_animals.iloc[:, 0].tolist()
print(animals_list)

In [None]:
# Load list of covid19:
df_covid19 = pd.read_csv('covid-19.csv')
#df_covid19.shape
#df_covid19.head()
covid19_list = df_covid19.iloc[:, 0].tolist()
print(covid19_list)

In [None]:
# Load list of drugs:
df_drugs = pd.read_csv('drugs.csv')
#df_drugs.shape
#df_drugs.head()
drugs_list = df_drugs.iloc[:, 0].tolist()
print(drugs_list)

In [None]:
# effectivenes
df_effectivenes = pd.read_csv('effectivenes.csv')
#df_effectivenes.shape
#df_effectivenes.head()
effectivenes_list = df_effectivenes.iloc[:, 0].tolist()
print(effectivenes_list)

In [None]:
# symptom
df_symptom = pd.read_csv('symptom.csv')
#df_symptom.shape
#df_symptom.head()
symptom_list = df_symptom.iloc[:, 0].tolist()
print(symptom_list)

In [None]:
# human
df_human = pd.read_csv('human.csv')
#df_human.shape
#df_human.head()
human_list = df_human.iloc[:, 0].tolist()
print(human_list)

In [None]:
# model
df_model = pd.read_csv('model.csv')
#df_model.shape
#df_model.head()
model_list = df_model.iloc[:, 0].tolist()
print(model_list)

In [None]:
# recipient
df_recipient = pd.read_csv('recipient.csv')
#df_recipient.shape
#df_recipient.head()
recipient_list = df_recipient.iloc[:, 0].tolist()
print(recipient_list)

In [None]:
# antiviral 
df_antiviral_agent  = pd.read_csv('antiviral.csv')
#df_antiviral_agent.shape
#df_antiviral_agent.head()
antiviral_agent_list = df_antiviral_agent.iloc[:, 0].tolist()
print(antiviral_agent_list)

In [None]:
# challenge
df_challenge = pd.read_csv('challenge.csv')
#df_challenge.shape
#df_challenge.head()
challenge_list = df_challenge.iloc[:, 0].tolist()
print(challenge_list)

In [None]:
# universal
df_universal = pd.read_csv('universal.csv')
#df_universal.shape
#df_universal.head()
universal_list = df_universal.iloc[:, 0].tolist()
print(universal_list)

In [None]:
# prioritize
df_prioritize = pd.read_csv('prioritize.csv')
#df_prioritize.shape
#df_prioritize.head()
prioritize_list = df_prioritize.iloc[:, 0].tolist()
print(prioritize_list)

In [None]:
# scarce
df_scarce = pd.read_csv('scarce.csv')
#df_scarce.shape
#df_scarce.head()
scarce_list = df_scarce.iloc[:, 0].tolist()
print(scarce_list)

In [None]:
# healthcare
df_healthcare = pd.read_csv('healthcare.csv')
#df_healthcare.shape
#df_healthcare.head()
healthcare_list = df_healthcare.iloc[:, 0].tolist()
print(healthcare_list)

In [None]:
# ppe
df_ppe = pd.read_csv('ppe.csv')
#df_ppe.shape
#df_ppe.head()
ppe_list = df_ppe.iloc[:, 0].tolist()
print(ppe_list)

In [None]:
# risk
df_risk = pd.read_csv('risk.csv')
#df_risk.shape
#df_risk.head()
risk_list = df_risk.iloc[:, 0].tolist()
print(risk_list)

In [None]:
# ADE
df_ADE = pd.read_csv('antibody.csv')
#df_ADE.shape
#df_ADE.head()
ADE_list = df_ADE.iloc[:, 0].tolist()
print(ADE_list)

In [None]:
# Use LOWER case
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

In [None]:
# Load list into NLP 

patterns = [nlp.make_doc(text) for text in therapeutics_list] 
matcher.add("1", None, *patterns)

patterns = [nlp.make_doc(text) for text in vaccine_list] 
matcher.add("2", None, *patterns)

patterns = [nlp.make_doc(text) for text in animals_list] 
matcher.add("3", None, *patterns)

patterns = [nlp.make_doc(text) for text in covid19_list] 
matcher.add("4", None, *patterns)

patterns = [nlp.make_doc(text) for text in drugs_list] 
matcher.add("5", None, *patterns)

patterns = [nlp.make_doc(text) for text in effectivenes_list]
matcher.add('6', None, *patterns)

patterns = [nlp.make_doc(text) for text in symptom_list]
matcher.add('7', None, *patterns)

patterns = [nlp.make_doc(text) for text in human_list]
matcher.add('8', None, *patterns)

patterns = [nlp.make_doc(text) for text in model_list]
matcher.add('9', None, *patterns)

patterns = [nlp.make_doc(text) for text in recipient_list]
matcher.add('10', None, *patterns)

patterns = [nlp.make_doc(text) for text in antiviral_agent_list]
matcher.add('11', None, *patterns)

patterns = [nlp.make_doc(text) for text in challenge_list]
matcher.add('12', None, *patterns)

patterns = [nlp.make_doc(text) for text in universal_list]
matcher.add('13', None, *patterns)

patterns = [nlp.make_doc(text) for text in prioritize_list]
matcher.add('14', None, *patterns)

patterns = [nlp.make_doc(text) for text in scarce_list]
matcher.add('15', None, *patterns)

patterns = [nlp.make_doc(text) for text in healthcare_list]
matcher.add('16', None, *patterns)

patterns = [nlp.make_doc(text) for text in ppe_list]
matcher.add('17', None, *patterns)

patterns = [nlp.make_doc(text) for text in risk_list]
matcher.add('18', None, *patterns)

patterns = [nlp.make_doc(text) for text in ADE_list]
matcher.add('19', None, *patterns)

In [None]:
df.describe()

In [None]:
import gc
gc.collect()

In [None]:
# add column to data to prepare ranking for given paper
df = df.assign(p_1=0,p_2=0,p_3=0,p_4=0,p_5=0,p_6=0,p_7=0,p_8=0,p_9=0,p_10=0,p_11=0,p_12=0,p_13=0,p_14=0,p_15=0,p_16=0,p_17=0,p_18=0,p_19=0)

In [None]:
df.head()

In [None]:
matching_rows = []
matching_paper_id = []

nlp.max_length = 206000000

pbar = tqdm()
pbar.reset(total=len(df)) 

for i, row in df[:].iterrows(): 
    pbar.update()
    if pd.isnull(row["Text"]):
        continue
    doc = nlp(row["Text"])
    matches = matcher(doc)
    if len(matches) > 0:
        matching_rows.append(i)
        matching_paper_id.append(row["paper_id"])
    for match_id, start, end in matches:
        # Get the string representation 
        string_id = nlp.vocab.strings[match_id]  #string_id shows matching location 
        span = doc[start:end]  
        df.loc[i, "p_" + string_id] = 1  

In [None]:
df.describe()

## <font color=black>Sub Task 10.1: Effectiveness of drugs being developed and tried to treat COVID-19 patients: Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication.</font>
***

In [None]:
# Prepare ranking: drugs + convid 19 + effectiveness
df = df.assign(rank=df["p_5"] + df["p_4"] + df["p_6"])

In [None]:
# This result should be added to the narrative, it's going to show the % per list matching
df.describe(include='all')

In [None]:
# Show total number of articles with matching list 
print(len(df[df["rank"] == 1]))
print(len(df[df["rank"] == 2]))
print(len(df[df["rank"] == 3]))

In [None]:
#print number of articles with all matching
df[df["rank"] == 3]

In [None]:
##word cloud matching drugs + effectiveness + sympton + covid 19 
text_world_cloud=""
for i, row in df[df["rank"] == 2].iterrows(): 
    text_world_cloud = text_world_cloud +" " + str (row["Title"])
#Visualization rank == 2
wordcloud_draw(text_world_cloud.lower())

## <font color=black>Sub Task 10.2: Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.</font>
***

In [None]:
# Prepare ranking: vacciness + ADE + covid 19  
df = df.assign(rank=df["p_2"] + df["p_19"] + df["p_4"])

In [None]:
# This result should be added to the narrative, it's going to show the % per list matching
df.describe()

In [None]:
# Show total number of articles with matching list 
print(len(df[df["rank"] == 1]))
print(len(df[df["rank"] == 2]))
print(len(df[df["rank"] == 3]))

In [None]:
#print number of articles with all matching
df[df["rank"] == 3]

In [None]:
##word cloud matching vacciness + receipts + ade   
text_world_cloud=""
for i, row in df[df["rank"] == 3].iterrows(): 
    text_world_cloud = text_world_cloud +" " + str (row["Title"])
#Visualization rank == 3
wordcloud_draw(text_world_cloud.lower())

## <font color=black>Sub Task 10.3:Exploration of use of best animal models and their predictive value for a human vaccine.</font>
***

In [None]:
# Prepare ranking: vacciness + animals + human + model + covid 19 
df = df.assign(rank=df["p_2"] + df["p_3"] + df["p_8"] + df["p_9"] + df["p_4"])

In [None]:
# This result should be added to the narrative, it's going to show the % per list matching
df.describe()

In [None]:
# Show total number of articles with matching list 
print(len(df[df["rank"] == 1]))
print(len(df[df["rank"] == 2]))
print(len(df[df["rank"] == 3]))
print(len(df[df["rank"] == 4]))
print(len(df[df["rank"] == 5]))

In [None]:
#print number of articles with all matching
df[df["rank"] == 5]

In [None]:
##word cloud matching vacciness + animals + human + model  
text_world_cloud=""
for i, row in df[df["rank"] == 4].iterrows(): 
    text_world_cloud = text_world_cloud +" " + str (row["Title"])
#Visualization rank == 5
wordcloud_draw(text_world_cloud.lower())

## <font color=black>Sub Task 10.4: Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.</font>
***

In [None]:
# Prepare ranking: therapeutics + symptons + antiviral agent + covid 19   
df = df.assign(rank=df["p_1"] + df["p_7"] + df["p_11"] + df["p_4"])

In [None]:
# This result should be added to the narrative, it's going to show the % per list matching
df.describe()

In [None]:
# Show total number of articles with matching list 
print(len(df[df["rank"] == 1]))
print(len(df[df["rank"] == 2]))
print(len(df[df["rank"] == 3]))
print(len(df[df["rank"] == 4]))

In [None]:
#print number of articles with all matching
df[df["rank"] == 4]

In [None]:
##word cloud matching therapeutics + symptons + antiviral agent 
text_world_cloud=""
for i, row in df[df["rank"] == 4].iterrows(): 
    text_world_cloud = text_world_cloud +" " + str (row["Title"])
#Visualization rank == 4
wordcloud_draw(text_world_cloud.lower())

## <font color=black>Sub Task 10.5:Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up. This could include identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need.</font>
***

In [None]:
# Prepare ranking: therapeutics + prioritize + covid 19  
df = df.assign(rank=df["p_1"] + df["p_14"] + df["p_4"])

In [None]:
# This result should be added to the narrative, it's going to show the % per list matching
df.describe()

In [None]:
# Show total number of articles with matching list 
print(len(df[df["rank"] == 1]))
print(len(df[df["rank"] == 2]))
print(len(df[df["rank"] == 3]))

In [None]:
#print number of articles with all matching
df[df["rank"] == 3]

In [None]:
##word cloud matching therapeutics + prioritize + scarce 
text_world_cloud=""
for i, row in df[df["rank"] == 3].iterrows(): 
    text_world_cloud = text_world_cloud +" " + str (row["Title"])
#Visualization rank == 4
wordcloud_draw(text_world_cloud.lower())

## <font color=black>Sub Task 10.6:Efforts targeted at a universal coronavirus vaccine.</font>
***

In [None]:
# Prepare ranking: vaccines + universal + covid 19 
df = df.assign(rank=df["p_13"] + df["p_4"])

In [None]:
# This result should be added to the narrative, it's going to show the % per list matching
df.describe()

In [None]:
# Show total number of articles with matching list 
print(len(df[df["rank"] == 1]))
print(len(df[df["rank"] == 2]))

In [None]:
#print number of articles with all matching
df[df["rank"] == 2]

In [None]:
##word cloud matching vaccines + universal 
text_world_cloud=""
for i, row in df[df["rank"] == 2].iterrows(): 
    text_world_cloud = text_world_cloud +" " + str (row["Title"])
#Visualization rank == 2
wordcloud_draw(text_world_cloud.lower())

## <font color=black>Sub Task 10.7:Efforts to develop animal models and standardize challenge studies.</font>
***

In [None]:
# Prepare ranking: animals + model + challenge + Covid 19 
df = df.assign(rank=df["p_3"] + df["p_9"] + df["p_12"] + df["p_4"])

In [None]:
# This result should be added to the narrative, it's going to show the % per list matching
df.describe()

In [None]:
# Show total number of articles with matching list 
print(len(df[df["rank"] == 1]))
print(len(df[df["rank"] == 2]))
print(len(df[df["rank"] == 3]))
print(len(df[df["rank"] == 4]))

In [None]:
#print number of articles with all matching
df[df["rank"] == 4]

In [None]:
##word cloud matching animals + model + challenge  
text_world_cloud=""
for i, row in df[df["rank"] == 4].iterrows(): 
    text_world_cloud = text_world_cloud +" " + str (row["Title"])
#Visualization rank == 4
wordcloud_draw(text_world_cloud.lower()) 

## <font color=black>Sub Task 10.8: Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers</font>
***

In [None]:
# Prepare ranking: healthcare + ppe + covid 19 
df = df.assign(rank=df["p_16"] + df["p_17"] + df["p_4"])

In [None]:
# This result should be added to the narrative, it's going to show the % per list matching
df.describe()

In [None]:
# Show total number of articles with matching list 
print(len(df[df["rank"] == 1]))
print(len(df[df["rank"] == 2]))
print(len(df[df["rank"] == 3]))

In [None]:
#print number of articles with all matching
df[df["rank"] == 3]

In [None]:
##word cloud matching healthcare + ppe 
text_world_cloud=""
for i, row in df[df["rank"] == 3].iterrows(): 
    text_world_cloud = text_world_cloud +" " + str (row["Title"])
#Visualization rank == 3
wordcloud_draw(text_world_cloud.lower())

## <font color=black>Sub Task 10.9: Approaches to evaluate risk for enhanced disease after vaccination
.</font>
***

In [None]:
# Prepare ranking: animals + model + challenge+ covid 19
df = df.assign(rank=df["p_3"] + df["p_9"] + df["p_12"] + df["p_4"])

In [None]:
# This result should be added to the narrative, it's going to show the % per list matching
df.describe()

In [None]:
# Show total number of articles with matching list 
print(len(df[df["rank"] == 1]))
print(len(df[df["rank"] == 2]))
print(len(df[df["rank"] == 3]))
print(len(df[df["rank"] == 4]))

In [None]:
#print number of articles with all matching
df[df["rank"] == 4]

In [None]:
##word cloud matching animals + model + challenge 
text_world_cloud=""
for i, row in df[df["rank"] == 4].iterrows(): 
    text_world_cloud = text_world_cloud +" " + str (row["Title"])
#Visualization rank == 4
wordcloud_draw(text_world_cloud.lower())

## <font color=black>Sub Task 10.10: Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models in conjunction with therapeutics</font>
***

In [None]:
# Prepare ranking: therapeutics + animals + model + covid 19
df = df.assign(rank=df["p_1"] + df["p_2"] + df["p_3"] + df["p_4"])

In [None]:
# This result should be added to the narrative, it's going to show the % per list matching
df.describe()

In [None]:
# Show total number of articles with matching list 
print(len(df[df["rank"] == 1]))
print(len(df[df["rank"] == 2]))
print(len(df[df["rank"] == 3]))
print(len(df[df["rank"] == 4]))

In [None]:
#print number of articles with all matching
df[df["rank"] == 4]


In [None]:
##word cloud matching animals + model + challenge 
text_world_cloud=""
for i, row in df[df["rank"] == 4].iterrows(): 
    text_world_cloud = text_world_cloud +" " + str (row["Title"])
#Visualization rank == 4
wordcloud_draw(text_world_cloud.lower())

In [None]:
#Cleaning up after each section to save space
gc.collect()

In [None]:
questions=['Effectiveness of drugs being developed and tried to treat COVID-19 patients.',
          'Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.',
          'Exploration of use of best animal models and their predictive value for a human vaccine.',
          'Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.',
          'Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up.',
          'Efforts targeted at a universal coronavirus vaccine.',
          'Efforts to develop animal models and standardize challenge studies.',
          'Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers.',
          'Approaches to evaluate risk for enhanced disease after vaccination.',
          'Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models.']

## Finding the list of papers related to Covid-19 


In [None]:
def list_entity(path):  # read the list of entites
    df = pd.read_csv(path)
    lists=[]
    for row in df.iterrows():
        lists.append(row[1].values[0])
    return lists  

# search the dataframe for covid_19 the keywords
df_covid=df[functools.reduce(lambda a, b: a|b, (df['Text'].str.contains(s) for s in list_entity('../covid-19.csv')))]

 
len(df_covid)


## Creating the name entities based on the question


The model is based on the Name Entity Recognition (NER), the details can be found in: http://urszulaczerwinska.github.io/works/egg_ner.

The tools that has been used in based on phrase matching in Spacy: https://spacy.io/usage/rule-based-matching.
   
For each question the list of entity is gathered from csv file and create the label and path to that label as dictionary.

In [None]:
main_ent={'main_1':'drugs','main_2':'antibody','main_3':'animals','main_4':'therapeutics'
          ,'main_5':'therapeutics','main_6':'vaccines','main_7':'animals','main_8': 'healthcare','main_9':'risk', 'main_10': 'vaccines'}

def entity_list(task,main_ent):
    path=f"../subtask_{task}"
    labels=defaultdict(list)
    paths_list=defaultdict(list)
    try:
        file_name=[]
        for file in os.listdir(path):
            if file.endswith(".csv"):
                file_name.append(file)
        main_name=f"main_{task}"
        labels['main'] = [os.path.splitext(name)[0] for name in file_name if os.path.splitext(name)[0]==main_ent[main_name]]
        labels['sides'] = [os.path.splitext(name)[0] for name in file_name if os.path.splitext(name)[0]!=main_ent[main_name]]
        for name in file_name:
            paths_list[os.path.splitext(name)[0]] = os.path.join(path,name)
    except (OSError, IOError) as e:
        print("The folder does not exist")
        
    return labels,paths_list
    

def list_entity(path):  # read the list of entites
    df = pd.read_csv(path)
    lists=[]
    for row in df.iterrows():
        lists.append(row[1].values[0])
    return lists  

def add_entities(labels,paths):  # add the list of entites
    for key in labels.keys():
        for val in labels[key]:
            patterns = [nlp(text) for text in list_entity(paths[val])] 
            matcher.add(val, None, *patterns) 

def remove_entities(labels):    # remove the list of entites
    for key in labels.keys():
        for val in labels[key]:
            matcher.remove(val) 
    
def check_existance(par,where_ind,doc):  # check if any entity exist on the par, output: give the dict with key equll to entity and value equll to 1 if it exist 
    dict_list=defaultdict(list)
    st=LancasterStemmer()
    for key in where_ind:
        for val in where_ind[key]:
            stem_par=[st.stem(word) for word in word_tokenize(par)]
            if st.stem(str(doc[val[0]:val[1]])) in stem_par:
                dict_list[key]=1
    return dict_list  

def prefrom_or(dict_list,labels):  
    exist=0
    for val in labels['sides']:
            if dict_list[val]==1:
                exist=1
    return exist  




def print_title_summary(titel_main,all_sent,publish_time,nlp_question):
    
    unique_titles = list(set(titel_main))
    scores=[]
    all_titles=[]
    all_text=[]
    out_put=pd.DataFrame(columns=['title','publish_time','text','scores'])
    
    for title in unique_titles:
        indices = [i for i, x in enumerate(titel_main) if x == title]
        text = []
        time=[]
        if indices: 
            for ind in indices:
                text.append(all_sent[ind])
                combined_text = ' '.join(text)
                time.append(publish_time[ind])
            
            score = nlp_question.similarity(nlp(combined_text))
            out_put=out_put.append({'title':title,'publish_time':time,'text':combined_text,'scores':score}, ignore_index=True)

 
    out_put=out_put.sort_values(by=['scores'],ascending=False)
    #for row in out_put.iterrows():
    #    display(HTML('<b>'+row[1]['title']+'</b> : <i>'+row[1]['text']+'</i>, ')) 
        
    return out_put
    #display(HTML('<b>'+title+'</b> : <i>'+combined_text+'</i>, '))     
            

Output of the following function is a dict where each key is the predefined entities and its value showing if that entity appears on par(paragraph or sentecnes)                   



## Post processing

In [None]:

def sent_vectorizer(sent, model):
    sent_vec =[]
    numw = 0
    for w in sent:
        try:
            if numw == 0:
                sent_vec = model[w]
            else:
                sent_vec = np.add(sent_vec, model[w])
            numw += 1
        except:
            pass
     
    return np.asarray(sent_vec) / numw


def sent2words(all_sent):
    sent_as_words = []
    #if all_sent is list():
    for s in all_sent:
        sent_as_words.append(s.split())
            
    #else:
    #    sent_as_words=all_sent.split()
    
    return sent_as_words


def sent_embedding(solution,all_sent):
    if solution== 'bert':
#            all_sent_list=  sent_tokenize(all_sent)      
        client = BertClient()
        embadded_vec = client.encode(all_sent) 
    else:

        sent_as_words = sent2words(all_sent)    
        model = Word2Vec(sent_as_words, min_count=1)
        embadded_vec=[]
        for sentence in sent_as_words:
            embadded_vec.append(sent_vectorizer(sentence, model))   

    return embadded_vec

def cluster_alg(NUM_CLUSTERS,embadded_vec):
    
    kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
    assigned_clusters = kclusterer.cluster(embadded_vec, assign_clusters=True)
    
    return assigned_clusters

def post_processing_bert(all_sent,ratio):
    NUM_CLUSTERS = 3
    embadded_vec = sent_embedding('bert',all_sent)  # embedding the sent based on solution
    assigned_clusters = cluster_alg(NUM_CLUSTERS,embadded_vec)
    #display(HTML('<b>'+'Highlights'+'</b>'))
    summary=cluster_summry(sent2words(all_sent),NUM_CLUSTERS,assigned_clusters,ratio)
    return summary

## Text summarization

In [None]:
#spacy.load('en_core_web_sm')
def cluster_summry(sent_as_words,NUM_CLUSTERS,assigned_clusters,ratio):
    st=LancasterStemmer()

    summary_dataframe=pd.DataFrame(columns=['keyword','summary'])
    summary_par = []
    keys_max=[]
    for c in range(NUM_CLUSTERS):
        sent_in_cluster = []
        for j in range(len(sent_as_words)):
            if (assigned_clusters[j] == c):
          
                sent_in_cluster.append(' '.join(sent_as_words[j]))
        if len(' '.join(sent_in_cluster)) > ratio :    
            summary_par = summarize(' '.join(sent_in_cluster), word_count = ratio)
            
        else: 
            summary_par = ' '.join(sent_in_cluster)


        j=0
        keyword_intia = keywords(' '.join(sent_in_cluster)).split('\n')[0]
        while st.stem(keyword_intia)  in keys_max:
            j+=1 
            keyword_intia=keywords(' '.join(sent_in_cluster)).split('\n')[j]
            
        keys_max.append(st.stem(keyword_intia))
        
        
        
        summary_dataframe=summary_dataframe.append({'keyword':keyword_intia,'summary':summary_par}, ignore_index=True)
    
    return summary_dataframe
                  


In [None]:
import sys

if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
    
import random    


def search_task(task, df_covid, main_ent,questions,q):

    all_sent = []
    titles = []
    publish_time=[]
    labels,paths = entity_list(task, main_ent)
    if labels:
        df_reduced=df_covid[functools.reduce(lambda a, b: b|a, (df_covid['Text'].str.contains(s) for s in list_entity(paths[main_ent[f"main_{task}"]])))]
        # to reduce runtime, randomly sample half of the data if rows > 2000
        #if len(df_reduced) > 200:
        #    df_reduced = df_reduced.sample(frac=0.1, replace=True, random_state=1)
        #print(df_reduced.shape)

        pbar = tqdm()
        pbar.reset(total=len(df_reduced)) 
        add_entities(labels,paths)   # add the entity to the exiting model
    
        for row in df_reduced.iterrows():  # go through body of each paper in dataframe
            pbar.update()
            doc = nlp(row[1]['Text'])         
            matches = matcher(doc)    # tag the predefined entities
            rule_id = []
            where_ind = defaultdict(list)
        
            for match_id, start, end in matches:
                rule = nlp.vocab.strings[match_id]
                nlp.max_length = 206000000
                rule_id.append(rule)  # get the unicode ID, i.e. 'COLOR'
                where_ind[rule].append((start,end))
            exist=0    
            for st in labels['sides']:
                if st in rule_id:
                    exist=1
                
            if labels['main'][0] in rule_id and exist:    # check the paper talk about at the first main topic
                for par in doc.sents:
                    dict_list = check_existance(par.text,where_ind,doc)
                
                    if dict_list[labels['main'][0]] == 1: 
                    
                        if prefrom_or(dict_list,labels)==1:      # check if the par has the combination of entities
                            all_sent.append(par.text)  # all senteces
                            titles.append(row[1]['Title'])
                            publish_time.append(row[1]['publish_time'])
                         
                            #display(HTML('<b>'+row[1]['title']+'</b> : <i>'+par.text+'</i>, '))  # print the related part of paper 
                
        #display(HTML('<b>'+questions[task-1]+'</b>' ))                 
        
        if all_sent:
            nlp_question = nlp(questions[task-1])
            score_papers=print_title_summary(titles,all_sent,publish_time,nlp_question)
            #print('csv out',task)
            score_papers.to_csv(output_dir + f"papers_subtask_{task}.csv")
            
            #summary=post_processing_bert(all_sent,100)
            #summary.to_csv(output_dir + f"summary_subtask_{task}.csv")
        else:
            #print('no all sent - task',task)
            score_papers=[]
            #remove_entities(labels)     # remove the existing entities
        
    #q.put((score_papers,task))
    #return(score_papers,task)
     
    res = 'Process worker ' + str(q)
    print("Worker finish job",q)
    q.put(res)
    return res

In [None]:
import multiprocessing as mp
import time

matcher = PhraseMatcher(nlp.vocab)  

def listener(q):
    """listens for messages on the q, writes to file. """
    print("start listener")
    while 1:
        m = q.get()
        print("listener get message: {}".format(m))
        if m == None:
            print("listener get kill message")
            break

def main():
    #must use Manager queue here, or will not work
    manager = mp.Manager()
    q = manager.Queue()    
    pool = mp.Pool(mp.cpu_count()+2)
    #put listener to work first
    watcher = pool.apply_async(listener, (q,))
    
    pbar = tqdm()
    pbar.reset(total=len(range(1,11))) 
    #fire off workers
    jobs = []

    for task in range(1,11):
        print('processing task', task)
        pbar.update()
        job=pool.apply_async(search_task,(task,df_covid, main_ent,questions,q) )
        jobs.append(job)

    # collect results from the workers through the pool result queue
    for job in jobs:
        #print('Get job -',job)
        job.get()
        
    #now we are done, kill the listener
    q.put(None)
    #q.task_done
    pool.close()
    pool.join()
    
if __name__ == "__main__":
   main()

In [None]:
#nlp = spacy.load("en")
#matcher = PhraseMatcher(nlp.vocab)  
#for task in range(1,11):
#    print('processing task', task)
#    search_task(task,df_covid, main_ent,questions)

In [None]:

def print_output(path):
    papers=pd.read_csv(path)
    df=papers.drop_duplicates()   
    df=df.dropna()
    df=df.drop(['Unnamed: 0'], axis=1)


    time=[]
    for i in range(len(df)):
        if df.iloc[i]['publish_time'][1:4]=='nan':
            time.append('nan')
        else:    
            time.append(df.loc[i]['publish_time'][2:12])

    df['publish time']=time
    df=df.drop(['publish_time'], axis=1)
    display(HTML(df.to_html()))

# 5. Notebook Results 

The following graphs shows the lists used per question and the number of articles that match the criteria. From that list, a clusterization of keywords that are interrelated produced the Text Examples for each Subtask.

### 5.1 Subtask 10.1:  Effectiveness of drugs being developed and tried to treat COVID-19 patients: Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication.


In [None]:
from PIL import Image

In [None]:
Image.open('/kaggle/input/ericsson-task10-img/T10_Task1.png')

In [None]:
if os.path.exists(output_dir + 'papers_subtask_1.csv'):
    print_output(output_dir + "papers_subtask_1.csv")
else: 
    display(HTML('<b>'+"No related article is found"+'</b>' )) 

### 5.2 Subtask 10.2:  Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.


In [None]:
Image.open('/kaggle/input/ericsson-task10-img/T10_Task2.png')

In [None]:
if os.path.exists(output_dir + 'papers_subtask_2.csv'):
    print_output(output_dir + "papers_subtask_2.csv")
else: 
    display(HTML('<b>'+"No related article is found"+'</b>' )) 


### 5.3 Subtask 10.3:  Exploration of use of best animal models and their predictive value for a human vaccine.


In [None]:
Image.open('/kaggle/input/ericsson-task10-img/T10_Task3.png')

In [None]:
if os.path.exists(output_dir + 'papers_subtask_3.csv'):
    print_output(output_dir + "papers_subtask_3.csv")
else: 
    display(HTML('<b>'+"No related article is found"+'</b>' )) 


### 5.4 Subtask 10.4:  Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.


In [None]:
Image.open('/kaggle/input/ericsson-task10-img/T10_Task4.png')

In [None]:
if os.path.exists(output_dir + 'papers_subtask_4.csv'):
    print_output(output_dir + "papers_subtask_4.csv")
else: 
    display(HTML('<b>'+"No related article is found"+'</b>' )) 

### 5.5 Subtask 10.5:  Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up. This could include identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need.


In [None]:
Image.open('/kaggle/input/ericsson-task10-img/T10_Task5.png')

In [None]:
if os.path.exists(output_dir + 'papers_subtask_5.csv'):
    print_output(output_dir + "papers_subtask_5.csv")
else: 
    display(HTML('<b>'+"No related article is found"+'</b>' )) 


### 5.6 Subtask 10.6:  Efforts targeted at a universal coronavirus vaccine.


In [None]:
Image.open('/kaggle/input/ericsson-task10-img/T10_Task6.png')

In [None]:
if os.path.exists(output_dir + 'papers_subtask_6.csv'):
    print_output(output_dir + "papers_subtask_6.csv")
else: 
    display(HTML('<b>'+"No related article is found"+'</b>' )) 

### 5.7 Subtask 10.7: Efforts to develop animal models and standardize challenge studies


In [None]:
Image.open('/kaggle/input/ericsson-task10-img/T10_Task7.png')

In [None]:
if os.path.exists(output_dir + 'papers_subtask_7.csv'):
    print_output(output_dir + "papers_subtask_7.csv")
else: 
    display(HTML('<b>'+"No related article is found"+'</b>' )) 

### 5.8 Subtask 10.8:  Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers


In [None]:
Image.open('/kaggle/input/ericsson-task10-img/T10_Task8.png')

In [None]:
if os.path.exists(output_dir + 'papers_subtask_8.csv'):
    print_output(output_dir + "papers_subtask_8.csv")
else: 
    display(HTML('<b>'+"No related article is found"+'</b>' )) 


### 5.9 Subtask 10.9:  Approaches to evaluate risk for enhanced disease after vaccination


In [None]:
Image.open('/kaggle/input/ericsson-task10-img/T10_Task9.png')

In [None]:
if os.path.exists(output_dir + 'papers_subtask_9.csv'):
    print_output(output_dir + "papers_subtask_9.csv")
else: 
    display(HTML('<b>'+"No related article is found"+'</b>' )) 

### 5.10 Subtask 10.10:  Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models (in conjunction with therapeutics)


In [None]:
Image.open('/kaggle/input/ericsson-task10-img/T10_Task10.png')

In [None]:
if os.path.exists(output_dir + 'papers_subtask_10.csv'):
    print_output(output_dir + "papers_subtask_10.csv")
else: 
    display(HTML('<b>'+"No related article is found"+'</b>' )) 

# 6. References

(1) https://spacy.io/

(2) https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

(3) https://www.nltk.org/

(4) https://scikit-learn.org/

(5) https://pypi.org/project/gensim/

(6) https://github.com/tqdm/tqdm

(7) https://matplotlib.org/

(8) https://catalog.data.gov/dataset

(9) https://meshb.nlm.nih.gov/

# 7. Acknowledgments


Alice Omana, Regulatory Affairs Pharmacist<br>
https://www.linkedin.com/in/alice-oma%C3%B1a-28659938


Eric Murillo, PhD, Full Professor at School of Medicine in Mexico<br>
https://www.linkedin.com/in/eric-murillo-rodriguez-phd-a7358445