# This notebook is for annotating COVID-19 Literature #
**Author**: [Remzi Celebi](https://www.maastrichtuniversity.nl/remzi.celebi), ([Vincent Emonet](https://www.maastrichtuniversity.nl/vincent.emonet), [Chang Sun](https://www.maastrichtuniversity.nl/chang.sun)) **Edit date**: 12-June-2020

**Abstract**: This notebook is one part of the project of COVID-19 Biomarker Discovery. The aim of this project is to establish a rich knowledge graph of potential COVID-19 biomarkers, with supporting evidence mined from the scientific literature and compiled from existing public databases. As number of publications on COVID-19 is significantly increasing every day, we developed this notebook to annotate the publications using their title, authors, DOIs, keywords, abstracts (full-text is working in progress). 
### Data resource ###
We used [datasets](https://dimensions.figshare.com/articles/Dimensions_COVID-19_publications_datasets_and_clinical_trials/11961063) from Dimensions contains all relevant publications, datasets and clinical trials from Dimensions that are related to COVID-19. Dimensions updates the dataset once every 24 hours, so the latest research can be viewed alongside existing information.
### Tool resources ###
1. Extract basic information from publications: [Semanticscholar](https://api.semanticscholar.org/)
2. Annotate from basic info of the papers: [BioPortal](http://data.bioontology.org/)

In [8]:
import os
import json
import requests
import pandas as pd
import urllib.request, urllib.error, urllib.parse # import url request
from rdflib import Dataset, URIRef, Literal, RDF, Namespace # import for RDF knowledge

In [None]:
## Uncomment this line and provide your BioPortal API Key here
#os.environ["BIOPORTAL_APIKEY"] = "MY_APIKEY"

### Notebook parameters

The notebook [Papermill parameters](https://papermill.readthedocs.io/en/latest/usage-parameterize.html) (can be changed outside of the notebooks, when running as a workflow)

In [9]:
# note: the data url maybe changed by data host
dataset_url = "https://dimensions.figshare.com/ndownloader/files/22797602"

# Path to the folder for input and output data
data_folder='/notebooks/data'

# Get BioPortal API key from the environment variable to hide the key as secret
# You can find how to get an API key from https://bioportal.bioontology.org/help#Getting_an_API_key
bioportal_apikey=os.environ["BIOPORTAL_APIKEY"]

### 1. Input: load the dataset from Dimensions website ### 

Latest version of `pandas` supports load the data from url directly

In [10]:
### Uncomment the below three lines if you wanna download the datset to local machine
# data_folder = '/notebooks/data'
# os.system('wget -O ' + data_folder + '/input/dimension.xlsx ' + input_file_url)
# dataset_url = data_folder + '/input/dimension.xlsx'

publications_dataframe = pd.read_excel(dataset_url, sheet_name='Publications') 

In [11]:
# Take a look at dataset
print("This dataset contains %d rows and %d columns. \n" %(len(publications_dataframe),len(publications_dataframe.columns)))
publications_dataframe.head()

This dataset contains 36833 rows and 31 columns. 



Unnamed: 0,Date added,Publication ID,DOI,PMID,PMCID,Title,Abstract,Source title,Source UID,Publisher,...,Research Organizations - standardized,GRID IDs,City of Research organization,Country of Research organization,Funder,UIDs of supporting grants,Times cited,Altmetric,Source Linkout,Dimensions URL
0,2020-05-20,pub.1127722683,10.4060/ca8799fr,,,Atténuer les effets de covid-19 sur le secteur...,,,,Food and Agriculture Organization of the Unite...,...,,,,,,,0,,,https://app.dimensions.ai/details/publication/...
1,2020-05-20,pub.1127724804,10.7238/c.n99.2036,,,'Fake news' y coronavirus (y II): las redes so...,,COMeIN,jour.1366192,Fundacio per la Universitat Oberta de Catalunya,...,Open University of Catalonia,grid.36083.3e,Barcelona,Spain,,,0,,https://doi.org/10.7238/c.n99.2036,https://app.dimensions.ai/details/publication/...
2,2020-05-20,pub.1127722401,10.3934/mbe.2020205,,,Phase-adjusted estimation of the COVID-19 outb...,,Mathematical Biosciences and Engineering,jour.1033105,American Institute of Mathematical Sciences (A...,...,,,,,,,0,,https://doi.org/10.3934/mbe.2020205,https://app.dimensions.ai/details/publication/...
3,2020-05-20,pub.1127715363,10.1136/leader-2020-000273,,,What healthcare leaders need to do to protect ...,,BMJ Leader,jour.1293146,BMJ,...,,,,,National Institute for Health Research,,0,67.0,https://bmjleader.bmj.com/content/leader/early...,https://app.dimensions.ai/details/publication/...
4,2020-05-20,pub.1127717858,10.14336/ad.2020.0402,,,COVID-19 in India: Are Biological and Environm...,,Aging and Disease,jour.1044156,Aging and Disease,...,,,,,,,0,,http://www.aginganddisease.org/EN/article/down...,https://app.dimensions.ai/details/publication/...


### 2. Output: write RDF Triple file to defined output folder ### 

In [12]:
output_data = data_folder + '/output/publications.ttl'
os.makedirs(os.path.dirname(output_data), exist_ok=True)

### 3. Retrieval: get the basic info about the publications ### 

In [13]:
def get_json(url):
    """ USE Semanticscholar API (https://api.semanticscholar.org/) to extract key 
        information about the paper presenting in JSON format.
    Args:
        url (URL): a valid link to Semanticscholar API; 
        following the format: https://api.semanticscholar.org/v1/paper/ + <paper DOI>.

    Returns:
        paper_info_json (JSON): the content about the paper in JSON format returned from Semanticscholar.
    """
    
    # Read content from the url
    opener = urllib.request.build_opener()
    opener.addheaders = [('Accept', 'application/json')]
    
    # Convert the content to JSON
    paper_info_json = json.loads(opener.open(url).read())
    
    return paper_info_json

### 4. Annotation: use Bioportal API to annotate the text from the publications ### 

In [14]:
def bioportal_get_json(API_KEY, param):
    """ Use BioPortal (http://data.bioontology.org/) to annotate the recognizable terms 
        from the some text from publications and return the annotated info in JSON.
    Args:
        API_KEY (string): an API key from your BioPortal account page.
        param (url): a valid (requestable) link which follows the template: 
        "http://data.bioontology.org/annotator?text=.....&ontologies=..."

    Returns:
        paper_bioportal_json (JSON): the annotated content from the text in JSON format returned from BioPortal.
    """
    
    REST_URL = "http://data.bioontology.org/"
    url = REST_URL + param
    opener = urllib.request.build_opener()
    opener.addheaders = [('Authorization', 'apikey token=' + API_KEY)]
    annotated_bioportal_json = json.loads(opener.open(url).read())
    
    return annotated_bioportal_json

In [15]:
def get_annotations(API_KEY, paper):
    """ Structure the request link for the annotation using BioPortal 
    Args:
        API_KEY (string): an API key from your BioPortal account page.
        paper (JOSN): the annotated content from the text in JSON returned from BioPortal

    Returns:
        annotation_result (JSON): the content about the paper returned from BioPortal.
    """
        
    # This can be abstract, full text, any sections of the paper
    text_to_annotate = paper['abstract'] 
    
    # You can choose your own suitable/preferrable ontologies in "&ontologies= ...."
    additional_parameters = "&ontologies=DOID,GO,MESH,COVID-19,MEDDRA,NDFRT&longest_only=true"
    
    # Annotate the text using BioPortal
    annotation_result = bioportal_get_json(API_KEY, "annotator?text=" + urllib.parse.quote(text_to_annotate) + additional_parameters)
    
    return annotation_result

### 5. Conversion: structure and convert the annotated text to RDF triple file   ### 

In [16]:
def convert_to_dict(text_to_annotate, annotated_results):
    """ Select and convert annotated results from JSON into dictionary. 
    Args:
        text_to_annotate (string): the text need to be annotated.
        annotated_results (JSON): the annotated results returned from BioPortal.

    Returns:
        annot_dict (dict): the dictionary including some elements from the annotated results.
    """
        
    annot_dict ={}
    for each_annotated_item in annotated_results:
        class_details = each_annotated_item["annotatedClass"]
        for element_in_annotation in each_annotated_item["annotations"]:
            
            # Get elements from the annotated results
            from_= element_in_annotation["from"]
            to_ = element_in_annotation["to"]
            match_type = element_in_annotation["matchType"]
            mention = element_in_annotation["text"]
            
            # Get text and annotated class info
            context = text_to_annotate
            concept_id =class_details["@id"]
            ontology = class_details["links"]["ontology"]
            
            # Compose them to a dictionary
            annot_dict[concept_id] =[mention, context, concept_id, ontology]
        
    return annot_dict

In [17]:
def convert_paper_to_rdf(predicate_to_uri, paper, annotated_results):
    """ Convert annotated text to RDF format  
    Args:
        paper (JSON): the content about the paper in JSON returned from Semanticscholar.
                        (return from get_json function)
        annotated_results (dict): the dictionary including some elements from the annotated results. 
                        (return from convert_to_dict function)

    Returns:
        dataset (Dataset in rdflib pck): triple datasets in RDF format.
    """

    # Define prefix
    DOI = Namespace("https://doi.org/")
    SORG = Namespace("https://schema.org/")
    RDFS = Namespace("http://www.w3.org/2000/01/rdf-schema#")
    # SKOS = Namespace("http://www.w3.org/2004/02/skos/core#")
    # RDF = Namespace("http://www.w3.org/1999/02/22-rdf-syntax-ns#")
    # COVID_INST = Namespace("http://www.w3id.org/covidkg/Instances/")
    
    # Use DOI as paper identifier
    paper_uri = DOI[paper['doi']]
    # Initialize Dataset from rdflib package
    dataset = Dataset() 
    
    for each_annotated_element in annotated_results:
        # Get annotated label and uri
        annot_name = annotated_results[each_annotated_element][0]
        annot_uri = URIRef(each_annotated_element)
        # Add label and uri to the dataset
        dataset.add((paper_uri, URIRef(predicate_to_uri['hasAnnotation']), annot_uri))
        dataset.add((annot_uri,  RDFS['label'], Literal(annot_name)))
        
    for each_paper_element in paper:
        # Get author information and add to dataset
        if each_paper_element == 'authors':
            for author in paper[each_paper_element]:
                author_uri = URIRef(author['url'])
                dataset.add((paper_uri, SORG['author'], author_uri))
                if 'name' in author:
                    dataset.add((author_uri, SORG['name'], Literal(author['name'])))
        # get abstract information and add to dataset
        elif each_paper_element == 'abstract':
            dataset.add((paper_uri, URIRef(predicate_to_uri[each_paper_element]), Literal(paper[each_paper_element]) ))
        # get DOI information and add to dataset
        elif each_paper_element == 'doi':
            dataset.add((paper_uri, URIRef(predicate_to_uri[each_paper_element]), Literal(paper[each_paper_element]) ))
        # get keywords information and add to dataset
        elif each_paper_element == 'topics':
            for topic in paper[each_paper_element]:
                topic_uri = URIRef(topic['url'])
                dataset.add((paper_uri, URIRef(predicate_to_uri['topic']), topic_uri))
                if 'topic' in topic:
                        dataset.add((topic_uri, RDFS['label'], Literal(topic['topic']) ))
                        
    return dataset

### 6. Main execution ###

In [18]:
def main():
    
    # Give paper doi
    doi = '10.1691/ph.2020.0431'
    # Semanticscholar API base url 
    BASE_API_URL = 'https://api.semanticscholar.org/v1/paper/'
    
    # Select schema and vocabularies for the publications
    predicate_to_uri= {'abstract':'https://schema.org/backstory',
                       'title':'https://schema.org/headline',
                       'doi':'https://schema.org/identifier',
                       'topic':'https://schema.org/keywords',
                       'hasAnnotation':'https://semanticscience.org/resource/SIO_000255'}
    
    try:
        paper = get_json(BASE_API_URL + doi)
        text_to_annotate = paper['abstract']
        result = get_annotations(bioportal_apikey, paper)  
        annot_dict = convert_to_dict(text_to_annotate, result)
        dataset = convert_paper_to_rdf(predicate_to_uri, paper, annot_dict)
        dataset.serialize(output_data, format='turtle')
    except:
        print ('Error while processing ', doi)
        raise

In [19]:
if __name__ == "__main__":
    main()