# What has been published about vaccines and therapeutics against COVID-19?

© 2020 Nokia.
Licensed under the BSD 3-Clause License.
SPDX-License-Identifier: BSD-3-Clause.

### Goal
There is a growing urgency to facilitate the assimilation of the rapidly increasing literature on COVID-19 by the scientific community.
We aim to address this need by automatically extracting insights about key medical entities from the provided literature corpus using Information Extraction techniques and Knowledge Graphs representations.
Our approach can be applied to any medical entity, as it builds a knowledge graph around the concepts of interest.
However, we submit it as an answer to "Task 4: What do we know about vaccines and therapeutics?" as we apply our methodology to answer the 10 questions of said task.

### Information Extraction Methodology
0. Structure the literature corpus as a dataframe of article sections
1. Keyword search to identify articles that contain entities of interest: COVID-19
2. Information Extraction using Stanford's OpenIE to extract IE triplets: (`object`, `relation`, `subject`)
3. Removing redundant triplets in each article section
4. Generating Triplet Database to answer questions
5. Report creation by building a Knowledge Graph and providing structured insights

### Acknowledgements
* Ivan Ega Pratama. [Dataset Parsing Code | Kaggle, COVID EDA: Initial Exploration Tool](https://www.kaggle.com/ivanegapratama/covid-eda-initial-exploration-tool).
* Manning, Christopher D., Surdeanu, Mihai, Bauer, John, Finkel, Jenny, Bethard, Steven J., and McClosky, David. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60. [Open IE](https://stanfordnlp.github.io/CoreNLP/openie.html)
* jezrael. [Pandas - Check if a value in a column is a substring of another value in the same column](https://stackoverflow.com/a/58951442)
* Ishamael. [Finding subsequence (nonconsecutive)](https://stackoverflow.com/a/29954829)
* Marius Borcan: [A Python implementation of a basic Knowledge Graph](https://github.com/bdmarius/python-knowledge-graph)
* Shahules786: [CORD: Tools and Knowledge graphs🛠️ 🛠️](https://www.kaggle.com/shahules/cord-tools-and-knowledge-graphs)
* Rathachai (used in the external version of the tool only): [D3RDF](https://github.com/Rathachai/d3rdf)

### Challenge Submissions
The present notebook consititutes our submission to the Task "What do we know about vaccines and therapeutics?".

We also present [Kaggle community contributions](https://www.kaggle.com/covid-19-contributions) for each of the three categories:
* **Literature Review**: This notebook outputs answers to the questions in the task in a standard CSV format as it is being done in the Kaglle community
* **Tools**: We offer our search capability that creates a knowledge graph and associated report as an interactive tool: http://covid19search.net/
* **Datasets**: We offer our structured dataset of extracted triplets (`object`, `relation`, `subject`) for each relevant section in the literature corpus: https://www.kaggle.com/enriquemartinlopez/covid19-ie-from-literature-processed-triplets

In [None]:
import pandas as pd

import json
from tqdm.notebook import tqdm

%matplotlib inline

%load_ext autoreload
%autoreload 2

In [None]:
# PARAMETERS

mode = 0 # 0 to run in kaggle, 1 to run locally

corpus_paths = ['/kaggle/input/CORD-19-research-challenge/', '../data/CORD-19-research-challenge/']
results_paths = ['/kaggle/input/covid19-ie-from-literature-precomputed-results/', '../results/']
covid_master_df_filtered_paths = ['/kaggle/input/covid19-ie-from-literature-processed-triplets/', '../results/']

corpus_path = corpus_paths[mode]
results_path = results_paths[mode]
covid_master_df_file = results_path+'covid_master_df.csv/covid_master_df.csv'
covid_master_df_filtered_file = covid_master_df_filtered_paths[mode]+'covid_master_df_filtered.csv'

# SAVING LITERATURE REVIEW TABLES
def save_to_csv(df, filename): 
    foldername = [
        '/kaggle/working/',
        '../results/'
    ]
    foldername = foldername[mode]
    
    df.to_csv(foldername+filename)

# Information Extraction Methodology

## Structure the literature corpus as a dataframe of article sections

We adapt the code from the notebook by Ivan Ega Pratama, from Kaggle.
[Dataset Parsing Code | Kaggle, COVID EDA: Initial Exploration Tool](https://www.kaggle.com/ivanegapratama/covid-eda-initial-exploration-tool)

The `create_dataframes` function below depends on the specific version of the dataset at the time of writing the code, so we recommend to simply load the pre-computed dataframe by running the corresponding (uncommented) cell. 

In [None]:
# from create_corpus_dataframe import create_dataframes

# df = create_dataframes(corpus_path, results_path)

In [None]:
# Loading pre-computed dataframe
df = pd.read_csv(results_path+'all_articles_content.csv/all_articles_content.csv', index_col=False)
df.sort_index().head()

We obtain a dataframe that organises the literature corpus with one article **section** per row, storing its **content** and keeping track of both its **title** and **paper_id**.

## Keyword search to identify articles that contain entities of interest: COVID-19

We provide the function `create_entity_df_dict`,which takes a list of entities, together with the dataframe created above, to produce a dictionary that maps each entity to a dataframe of articles that contain said entity.

The function `combine_df_dict` combines the dataframes for each entities of the entities to return a single dataframe with the same shape as the one above.

**This method can be applied to filter the corpus by any list of concepts of interest.**
However, for this notebook submission we will focus on the literature on COVID-19.

In [None]:
def create_entity_df_dict(entity_list, corpus_df):
    entity_dfs = {}
    for entity in entity_list:
        entity_dfs[entity] = corpus_df[\
                                       (corpus_df['content'].str.lower().str.contains(entity))|\
                                       (corpus_df['section'].str.lower().str.contains(entity))|\
                                       (corpus_df['title'].str.lower().str.contains(entity))
                                      ].copy()
        print('{}: {} articles found'.format(entity, len(entity_dfs[entity])))
    return entity_dfs

def combine_df_dict(entity_dfs):
    combined_df = pd.DataFrame()
    for df in entity_dfs.values():
        combined_df = pd.concat([combined_df, df])
    combined_df.drop_duplicates(inplace=True)
    print('Total articles: {}'.format(len(combined_df)))
    return combined_df

To do so, we will create a list with all the official names of the coronavirus disease (COVID-19) and the virus that causes it.

This list is taken from the WHO website:
https://www.who.int/emergencies/diseases/novel-coronavirus-2019/technical-guidance/naming-the-coronavirus-disease-(covid-2019)-and-the-virus-that-causes-it

At the time of our search, it includes the following terms:

In [None]:
covid_ent_list = ['coronavirus disease', 'covid-19', 'severe acute respiratory syndrome coronavirus 2', 'sars-cov-2']

In [None]:
# Storing the dataframes for each concept of interest
covid_dfs = create_entity_df_dict(covid_ent_list, df)

In [None]:
# Combining the dataframes
covid_combined_df = combine_df_dict(covid_dfs)
covid_combined_df.sort_index().head()

## Information Extraction using Stanford's OpenIE

Information Extraction is a Natural Langua Processing taks that extracts concept triplets from a given piece of text. Such triplets are formed by two concepts and the relationship between them: (`object`, `relation`, `subject`).

For instance, given "Washington is the capital city of US" we could extract the triplet annotation (`Washington`, `to be the capital city`, `US`).

We will use a widespread tool for this task: Stanford's Open IE.
The information for installing a Python wrapper for Stanford's OpenIE can be found [here](https://pypi.org/project/stanford-openie/).

It is however not possible to install this module in the Kaggle's kernel, so for the sake of reproducibility, let us copy below the corresponding code that can be run locally to extract all triplet annotations from the different sections in the dataframe.

For this notebook submission you can just load the pre-computed results in the corresponding cell.

In [None]:
# from openie import StanfordOpenIE

# def trim(annotations):
#     for sent in annotations['sentences']:
#         del sent['basicDependencies']
#         del sent['enhancedDependencies']
#         del sent['enhancedPlusPlusDependencies']
#     return annotations

# def create_annotations_list(text_list):
#     annotations_list = []
#     with StanfordOpenIE() as client:
#         for text in tqdm(text_list):
#             if isinstance(text, str):
#                 annotations = client.annotate(text, simple_format=False)
#                 annotations = trim(annotations)
#                 annotations_list.append(annotations)
#             elif isinstance(text, list):
#                 body_annots = []
#                 for paragraph in text:
#                     annotations = client.annotate(paragraph, simple_format=False)
#                     annotations = trim(annotations)
#                     body_annots.append(annotations)
#                 annotations_list.append(body_annots)
#             else:
#                 print('Wrong object passed to Information Extractor')
#     return annotations_list


# def create_covid19_triplets(text_list, covid_combined_df, covid_full_triplet_file):

#     annotations_list = create_annotations_list(text_list)
#     print(len(annotations_list))
#     with open(covid_full_triplet_file, 'w') as f:
#         json.dump(annotations_list, f)

#     # Creating a lite version of the dictionary including just triplets as a column for the dataframe
#     annotations_column = []
#     for row in annotations_list:
#         row_ann_list = []
#         for sentence in row['sentences']:
#             for triplet_d in sentence['openie']:
#                 row_ann_list.append({k: triplet_d[k] for k in ('subject', 'relation', 'object')})
#         annotations_column.append(row_ann_list)
#     # Adding to the dataframe
#     covid_combined_df['triplets'] = annotations_column
#     # Exporting
#     covid_combined_df.to_csv(covid_master_df_file)

#     return covid_combined_df


# df = create_covid19_triplets(covid_combined_df['content'], covid_combined_df, covid_full_triplet_file)

In [None]:
# Loading precomputed dataframe including all triplets for each section of the journal articles
covid_master_df = pd.read_csv(covid_master_df_file)
covid_master_df.head()

## Removing redundant triplets in each article section

Stanford's OpenIE tool attempts to extract all possible triplets present in a piece of text. However, this may lead to situations in which some of the extracted triplets are redundant.
For instance, take the following sentence:
> Several studies have reported the typical lung features of the disease on chest CT.

OpenIE will extract, amongst others, the following triplets for this sentence:
> {
    "subject": "several studies",
    "relation": "have reported",
    "object": "lung features of disease"
  }

> {
    "subject": "studies",
    "relation": "have reported",
    "object": "typical lung features of disease"
  }
  
> {
    "subject": "several studies",
    "relation": "have reported",
    "object": "typical lung features of disease"
  }
  
> {
    "subject": "several studies",
    "relation": "have reported",
    "object": "lung features on chest CT."
  }
  
> {
    "subject": "studies",
    "relation": "have reported",
    "object": "lung features of disease on chest CT."
  }
  
To avoid cluttering our information extraction task, we will apply a simple filter that keeps the longest `relation` between the same `subject` and `object`, and other things equal, the longest `subject` and `object`.
Therefore, our function to remove "Duplicate" triplets will return just the following ones in this example:
> {'subject': 'several studies',
  'relation': 'have reported',
  'object': 'lung features on chest CT.'}
  
> {'subject': 'several studies',
  'relation': 'have reported',
  'object': 'typical lung features of disease'}
  
> {'subject': 'studies',
  'relation': 'have reported',
  'object': 'lung features of disease on chest CT.'}
  
We acknoledge that this part of the methodology can be improved, by doing some lemmatization, stemming, and even clustering of concepts.
However due to time constrains we will leave this improvement as future work.

We create the `covid` library including the `ArticleFilter` class that we will use to remove redundant triplets and create reports that answer the questions in this task.

In [None]:
import covid as cv

In [None]:
# Create `ArticleFilter` instance and showing some relevant information
af = cv.ArticleFilter(covid_master_df_file)
af.info()

Removing redundant triplets takes some time, so we provide the code below but recommend importing the results using the corresponding cell.

In [None]:
# af.removeDuplicateTriplets()
# af.info()
# af.cdf.to_csv(covid_master_df_filtered_file)

In [None]:
# Importing dataframe with filtered triplets
af = cv.ArticleFilter(covid_master_df_filtered_file)
af.info()

We can inspect this dataframe, which we contribute as a [Kaggle community contributions](https://www.kaggle.com/covid-19-contributions), since we believe is one of the main outputs of our work, and we hope it can be reused by other researchers.

In [None]:
af.cdf.sort_index().head()

## Generating Triplet Database to answer questions

We will finally create a concise dataframe with triplet elements that will be used by the subroutine that filters through the literature.

Generating this will take a minute or two.

In [None]:
af.generateTripletsDB()

Let us inspect the resulting dataframe

In [None]:
af.triplets.sort_index().head()

## Report creation by building a Knowledge Graph and providing structured insights

We will create a knowledge graph with directed edges from `subject` to `object` that are labelled by the corresponding `relation`.

The user will provide a set of search keywords that are relevant to the question.
Such keywords are structured to filter the Covid-19 literature dataframe based on the string content of the `title`, `section`, and `content`; and also the string content of the `subject`, `relation` , and `object` of the corresponding triplets.

We acknowledge more sophisticated approaches to achieve this with greater accuracy, but this serves our purpose of building the methodology end-to-end to then improve certain parts with more sophisticated NLP methods.

The search function performs the following steps:

1) create a new DataFrame object that combines together the article sections and triplets, as obtained from steps outlined in the previous sections .

2) search through the specifically targeted columns of this new DataFrame to find the mentions of the keywords passed on to the function.

3) extract only the contents where the keywords are present to create a graph with relevant information.

Let us see an example, taken somehow arbitrarily from a discussion from this challenge: "Epidemiologist available to hand-code records for training sets".
This somehow illustrates the generality of the method. In particular, an epidemiologist [proposes the following question](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/137027#778986) as interesting:

### "Efforts to define the natural history of disease to inform clinical care, public health interventions, infection prevention control, transmission, and clinical trials"

In [None]:
filter = {'section': 'history', 'triplet':'disease|care|public|prevention|transmission|infection|trials'}
af.search(filter, html=True, max_items=7)

We will now attempt to answer all the questions in this task following the same reporting structure:
* Building a graph visualization based on the selected triplets
* Producing a literature extract using the first items from the dataframe (the `max_items` argument limits the number of HTML items to render)
* Producing a CSV table with the relevant triplets

# Q1/10: Effectiveness of drugs being developed and tried to treat COVID-19 patients.

### Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin and minocycline that may exert effects on viral replication.

In [None]:
filter = {'content': 'replication', 'subject':'naproxen|clarithromycin|minocycline|viral inhibitor'}
st_df = af.search(filter, html=True)

In [None]:
filename = 'Effectiveness of drugs being developed and tried to treat COVID-19 patients.csv'
save_to_csv(st_df,filename)
st_df

# Q2/10 Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients

In [None]:
filter = {'content': 'ADE' , 'subject':'antibody-dependent enhancement'}
st_df = af.search(filter, html=True)

In [None]:
filename = 'Methods evaluating potential complication of Antibody-Dependent Enhancement in vaccine recipients.csv'
save_to_csv(st_df,filename)
st_df

# Q3/10: Exploration of use of best animal models and their predictive value for a human vaccine.

In [None]:
filter = {'title':'vaccine', 'triplet':'animal|dog|mouse|cat|vamp|bird'}
st_df = af.search(filter, html=True)

In [None]:
filename = 'Exploration of use of best animal models and their predictive value for a human vaccine.csv'
save_to_csv(st_df,filename)
st_df

# Q4/10: Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeuticss, to include antiviral agents

In [None]:
filter = {'content': 'therapeutic|antiviral', 'subject':'therapeutic'}
st_df = af.search(filter, html=True)

In [None]:
filename = 'Capabilities to discover a therapeutic (not vaccine) for the disease.csv'
save_to_csv(st_df,filename)
st_df

# Q5/10: Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up.

### This could include identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need.

In [None]:
filter = {'content': 'therapeutic|antiviral|distribute|production', 'subject':'new therapeutic|scarce therapeutic'}
st_df = af.search(filter, html=True)

In [None]:
filename = 'Alternative models to aid decision makers in determining how to prioritize and distribute scarce newly proven therapeutics.csv'
save_to_csv(st_df,filename)
st_df

# Q6/10: Efforts targeted at a universal coronavirus vaccine.

In [None]:
filter = {'content': 'vaccination|vaccine|universal vaccine',\
          'subject':'vaccination|vaccine|universal vaccine'}
st_df = af.search(filter, html=True)

In [None]:
filename = 'Efforts targeted at a universal coronavirus vaccine.csv'
save_to_csv(st_df,filename)
st_df

# Q7/10: Efforts to develop animal models and standardize challenge studies

In [None]:
filter = {'content': 'animal model|challenge study|challenge studies|standardize|standardise',\
          'subject':'animal models|challenge study|challenge studies'}
st_df = af.search(filter, html=True)

In [None]:
filename = 'Efforts to develop animal models and standardize challenge studies.csv'
save_to_csv(st_df,filename)
st_df

# Q8/10: Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers

In [None]:
filter = {'content': '', 'subject':'healthcare worker|doctor|nurse|front-line|prophylaxis'}
st_df = af.search(filter, html=True)

In [None]:
filename = 'Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers.csv'
save_to_csv(st_df,filename)
st_df

# Q9/10: Approaches to evaluate risk for enhanced disease after vaccination

In [None]:
filter = {'content': 'risk after vaccination|enhanced disease', 'subject':'risk|vaccine|enhanced disease'}
st_df = af.search(filter, html=True)

In [None]:
filename = 'Approaches to evaluate risk for enhanced disease after vaccination.csv'
save_to_csv(st_df,filename)
st_df

# Q10/10: Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models in conjuntion with therapeutics

In [None]:
filter = {'content': 'vaccine', 'subject':'assay'}
st_df = af.search(filter, html=True)

In [None]:
filename = 'Assays to evaluate vaccine immune response and process development for vaccines.csv'
save_to_csv(st_df,filename)
st_df

# Discussion: Pros and Cons of this approach

We have implemented a general solution that provides an end-to-end methodology to extract information of interest from a text corpus.
By applying it to articles related to Covid-19, our tool can provide any information of interest in a concise, structured manner (Knowledge Graph representation and CSV table with relevant subject-relation-object triplets) given a keyword search by the user.
We apply our solution on the Covid-19 subset of articles to extract information of interest for the Challenge Task "What do we know about vaccines and therapeutics?".

We would like to emphasize the flexibility of our approach, which is based on dividing the corpus of articles in article sections, and within those, extracting and indentifying relevant triplets.

Our solution provides:
- A flexible framework for information retrieval from a corpus of scientific articles
- Extraction of triplets per article section, providing full traceability back to the source article(s)/section(s)
- Keyword search based on both title/section/content of the articles and extracted triplets (object, relation, subject)
- Knowledge Graph representation of the results
- Graphical front-end as implemented in our externally hosted tool http://covid19search.net


We also identify some current weaknesses that we can improve and iterate in next revisions:
- Expand the search to articles in the corpus that do not explicitly mention the Covid-19 keywords
- Improve the flexibility of the search approach by adding more logical operators, at the moment the different keys of the filter dictionary are combined in an AND fashion, and the values only allow OR opperations
- Usage of NLP techniques from lemmatization, stemming and even embedding feature representations to obtain better results than using keyword searches and being able to normalize triplet entities
- Exploit the Knowledge Graph representation by leveraging transitive relations between the nodes, this would require good normalization of the triplet entities in the fashion indicated above

The usage of more sophisticated NLP would help combat the main limitation of the current approach: much of the ability to find relevant answers to these questions rely on a skilful selection of the keywords. In cases where the keywords are nouns that are highly specific medical terms (e.g. naproxen, ADE) it is much more straightforward to extract relevant information. However, when the questions are asked around more general nouns (e.g. production, healthcare workers) it is more challenging to find specific results. The sub-graph produced by the search algorithm becomes non-sparse due to the use of these general nouns in a wide-variety of contexts, that may not necessarily relate to the question being answered. Moreover, a large number of synonyms for one given keyword can present time-constraint challenges to performing a full search on the knowledge graph.

Due to these limitations we were not able to find direct answers to the posed questions. Here are some hand-picked examples of triplets related to the questions, although they do not directly provide the answer. However, they can be traced back to the corresponding article section in the HTML report or using the provided triplet tables.

#### Q1/10: Effectiveness of drugs being developed and tried to treat COVID-19 patients.

* New broad-spectrum anti-coronavirai
* Is with 
* Minimum side-effects

#### Q2/10 Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.

* Antibody-dependent enhancement  
* be cause of 
* Vaccine failure

#### Q3/10 Exploration of use of best animal models and their predictive value for a human vaccine.

* Clinical manifestations
* Can vary widely between
* Animal models

#### Q4/10 Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.

* Therapeutics
* Specific against
* COVID-19

#### Q5/10 Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up. This could include identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need.

* New therapeutic measures
* Thus are needed for
* Treatment of ICU patients

#### Q6/10 Efforts targeted at a universal coronavirus vaccine.

* Inactivated vaccine
* Is available for
* Control of canine coronavirus infection

#### Q7/10 Efforts to develop animal models and standardize challenge studies

* Animale models 
* Uncover
* Mechanism of viral pathogenicity from entrance to transmission

#### Q8/10 Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers

* Front-line healthcare workers
* Are vulnerable to 
* emotional impact of coronavirus Maunder et.

#### Q9/10 Approaches to evaluate risk for enhanced disease after vaccination

* Vaccines 
* Attenuated
* Parainfluenza virus type 3

#### Q10/10 Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models [in conjunction with therapeutics]

* Synthetic peptide vaccine approach
* Sometimes be more effective than
* Traditional vaccine methods

The authors acknowledge the listed shortcomings of the extractions but believe their approach of information extraction from relevant sections, graph representations and more sophisticated extraction algorithms from the Knowledge Graph can provide a general framework for extracting insights from the corpus of scientific literature.