# Analysis of the IDP Knowledge Graph

__Authors:__  
Alasdair J G Gray ([ORCID:0000-0002-5711-4872](http://orcid.org/0000-0002-5711-4872)), _Heriot-Watt University, Edinburgh, UK_

Petros Papadopoulos ([ORCID:0000-0002-8110-7576](https://orcid.org/0000-0002-8110-7576)), _Heriot-Watt University, Edinburgh, UK_

Ivan Mičetić ([ORCID:0000-0003-1691-8425](https://orcid.org/0000-0003-1691-8425)), _University of Padua, Italy_

Andras Hatos ([ORCID:0000-0001-9224-9820](https://orcid.org/0000-0001-9224-9820)), _University of Padua, Italy_

Imran Asif ([ORCID:0000-0002-1144-6265](https://orcid.org/0000-0002-1144-6265)), _Heriot-Watt University, Edinburgh, UK_


__License:__ Apache 2.0

__Acknowledgements:__ This work was funded as part of the [ELIXIR Interoperabiltiy Platform](https://elixir-europe.org/platforms/interoperability) Strategic Implementation Study [Exploiting Bioschemas Markup to Support ELIXIR Communities](https://elixir-europe.org/about-us/commissioned-services/exploiting-bioschemas-markup-support-elixir-communities). This notebook builds upon the work conducted during the Virtual BioHackathon-Europe 2020 reported in [BioHackrXiv](https://biohackrxiv.org/v3jct/).

## Introduction

In this notebook we supply a user interface to query the IDP-KG from a set of predefined queries. The IDP knowledge graph was constructed from Bioschemas markup embedded in DisProt, MobiDb, and Protein Ensemble Database (PED) that was harvested using the Bioschemas Markup Scraper and Extractor and converted into a knowledge graph using the process in this [notebook](https://github.com/AlasdairGray/IDPcentral/blob/main/notebooks/ETLProcess.ipynb). 

The available queries can be grouped as discussed in the following subsections.

### HCLS Dataset Statistics

As part of the W3C Health Care and Life Sciences Interest Group [guidelines for dataset descriptions](https://www.w3.org/TR/hcls-dataset/), they provided a set of queries that provide useful statistics to understand a dataset. The queries range from providing basic statistics such as number of triples, classes, and objects to more complex statistics such as the number of typed objects linked to a property. 

All 14 queries are available through the interface below:
1. Number of triples
1. Number of unique typed entities
1. Number of unique subjects
1. Number of unique properties
1. Number of unique objects
1. Number of unique classes
1. Number of unique literals
1. Number of graphs
1. Number of instances per class
1. Number of occurrences of each property
1. Number of unique typed subjects and triples linked with each property
1. Number of unique typed objects linked with each property
1. Triples and number of unique literals related with each property
1. Number of unique subject types that are linked to unique object types

### IDP-KG Analysis Queries
The HCLS statistics provide generic dataset statistics. We will now focus on information about the data content that is of interest to the IDP community.

#### Analysis of Proteins
The following queries focus on the proteins in the IDP-KG:
1. Number of distinct proteins: note that the same protein can be present in multiple named graphs due to the overlap of the data sources.
1. Number of proteins per dataset
1. Number of datasets, and datasets, associated with each protein
1. Venn analysis of proteins from the three data sources; query calculates the number of proteins in each of the different intersections of the three data sources
1. Number of source pages for each protein; in PED the same protein can be found on multiple pages
1. Minimal protein information; retrieves some basic information about each protein
1. Full protein information; retrieves all properties associated with each protein


### Analysis of SequenceAnnotations
The following queries focus on the `SequenceAnnotation` type in the IDP-KG:
1. Number of sequence annotations per dataset
1. Number of sequence annotations that come from multiple datasets; note that as sequence annotations are not merged we do not expect any answers to this query
1. Number of sequence annotations that come from multiple pages; note that as sequence annotations are not merged we do not expect any answers to this query
1. Sequence annotation information; return properties associated with a sequence annotation
1. Number of scholarly articles associated with each sequence annotation
1. Number of sequence annotations associated with each scholarly article
1. Number of sequence annotations linked with each IDP Ontology term code

### Queries connecting Proteins with their annotations
The following queries follow links between proteins and their annotations
1. Number of annotations by protein where the annotations come different datasets
1. Number of annotations by protein where the annotations come different pages; note that PED can have the same protein appearing on multiple pages
1. Protein/SequenceAnnotation information query

## Query Execution

The cells in this section set up the querying infrastructures (RDFlib or external SPARQL endpoint), and details the queries that are read in from the [query directory](https://github.com/AlasdairGray/IDPcentral/tree/main/queries).

It then presents a couple of user input selection widgets to allow the user to select the data they wish to query and the query to execute. The data that can be queried are: 
- Supplying your own endpoint location (recommended if you want to query the full IDP-KG;
- Using one of the IDP-KGs from the repository: test-8, sample-25, and full scrape. These will be executed using the Python RDFlib SPARQL processor which is not very efficient for larger datasets;
- Using the file generated with the [ETLProcess notebook](https://github.com/AlasdairGray/IDPcentral/blob/main/notebooks/ETLProcess.ipynb)

### Library Imports

The notebook makes use of a number of libraries to execute the queries and display the results back.

In [None]:
# Import and configure logging library
from datetime import datetime
import logging
logging.basicConfig(
    filename='idpQuery.log', 
    filemode='w', 
    format='%(levelname)s:%(message)s', 
    level=logging.INFO)
logging.info('Starting processing at %s' % datetime.now().time())

In [None]:
# Imports for the UI
import ipywidgets as widgets
from ipywidgets import Layout
from IPython.core.display import display, HTML, Javascript
from IPython.display import clear_output
import clipboard
from SPARQLWrapper import SPARQLWrapper, JSON, XML
import json
import glob
import html

In [None]:
# Imports from RDFlib
import rdflib
from rdflib import ConjunctiveGraph, plugin
from rdflib.serializer import Serializer

In [None]:
%%html
<link rel="stylesheet" href="../css/custom.css">

### Result Display Function

The following function takes the results of a `SPARQL SELECT` query and displays them using a HTML table for human viewing.

In [None]:
md_format = ''
def displayResults(queryResult):
    HTMLResult = '<div class="full-width"><p>Number of results: ' + str(len(queryResult['results']['bindings'])) + '</p>'
    HTMLResult = HTMLResult + '<table class="table"><tr>'
    
    global md_format
    md_format = ''
    # print variable names and build header:
    for varName in queryResult['head']['vars']:
        HTMLResult = HTMLResult + '<th>' + varName + '</th>'
        md_format = md_format + ' | ' + varName
    HTMLResult = HTMLResult + '</tr>'
    
    md_format = md_format + ' | \n'
    #define Align of columns
    for row in queryResult['results']['bindings']:
        for column in queryResult['head']['vars']:
            if column != "":
                strVal = str(row[column]['value'])
                if not strVal.isnumeric():
                    md_format = md_format + ' | :---'
                else:
                    md_format = md_format + ' | ---:'
        break
                    
    md_format = md_format + ' | \n'
    # print values from each row and build table of results
    for row in queryResult['results']['bindings']:
        HTMLResult = HTMLResult + '<tr>' 
        for column in queryResult['head']['vars']:
            #print("COLUMN:", column)
            if column != "":
                strVal = str(row[column]['value'])
                align = "" if not strVal.isnumeric() else "text-align: right;padding-right:1%;"
                value = strVal if not strVal.isnumeric() else "{:,}".format(int(strVal)) 
                HTMLResult = HTMLResult + '<td style="'+align+'">' +  value + '</td>'
                md_format = md_format + ' | ' + value
            else:
                HTMLResult = HTMLResult + '<td>' + "N/A"+ '</td>'
                md_format = md_format + ' | N/A' 
        HTMLResult = HTMLResult + '</tr>'
        md_format = md_format + ' | \n'
        
    HTMLResult = HTMLResult + '</table></div>'
    return HTMLResult
    #display(HTML(HTMLResult))

### Loading and Querying IDP-KG

The following code will read in the data selected by the user, or configure the SPARQL endpoint to use. It also orders the queries that can be chosen as those that correspond the HCLS statistics queries or IDP-KG analysis queries.

In [None]:
idpKG = None
opt = ''  #selection option
query_options = [] # use in dropdown list

queryOrder = { #All queries must be entered here.
			  1: 'hcls-stats/number-triples.rq', 
     		  2: 'hcls-stats/typed-entities.rq', 
			  3: 'hcls-stats/number-subjects.rq', 
			  4: 'hcls-stats/number-properties.rq', 
			  5: 'hcls-stats/number-objects.rq',
			  6: 'hcls-stats/number-classes.rq',
			  7: 'hcls-stats/number-literals.rq',
			  8: 'hcls-stats/number-graphs.rq',
			  9: 'hcls-stats/class-count.rq',
			  10: 'hcls-stats/properties-ccurence.rq',
			  11: 'hcls-stats/property-subjects-triples.rq',
			  12: 'hcls-stats/number-typed-objects-linked-property.rq',
			  13: 'hcls-stats/triples-literals-related-property.rq',
			  14: 'hcls-stats/number-subject-types-object-types.rq',
			  15: 'proteins/protein-count.rq',
			  16: 'proteins/protein-per-dataset.rq',
			  17: 'proteins/protein-multi-datasets.rq',
			  18: 'proteins/proteins-by-dataset-groupings.rq',
			  19: 'proteins/protein-multi-pages.rq',
			  20: 'proteins/protein-information-minimal.rq',
		  	  21: 'proteins/protein-information-all.rq',
			  22: 'annotations/annotation-per-dataset.rq',
			  23: 'annotations/annotations-multi-datasets.rq',
			  24: 'annotations/annotations-multi-pages.rq',
			  25: 'annotations/annotation-details.rq',
			  26: 'annotations/annotation-scholarly-articles.rq',
			  27: 'annotations/annotations-per-article.rq',
			  30: 'annotations/annotations-by-term-code.rq',
			  31: 'annotations/protein-annotations-multi-datasets.rq',
		      32: 'annotations/protein-annotation-count.rq',
		      33: 'annotations/list-annotations.rq'
			 }

# Change the order in dict key, the following line sort it and execute query
queryOrder = {k: queryOrder[k] for k in sorted(queryOrder)}

query_options.append(('-- Select Query --', 'select'))
query_options.append(('All Queries', 'all'))

for key in queryOrder:
    text = '' #queryOrder[key].split("/")[1]
    with open('../queries/'+queryOrder[key]) as f:
        query = f.read()
        if '#' in query.partition('\n')[0]:
            text = query.partition('\n')[0].replace('#','').strip()
            
    query_options.append((text, queryOrder[key]))
    
# Following method set the graph variable to sparql endpoint or local in-memory    
def set_variable(loadingOpt, endpoint):
    global idpKG
    if loadingOpt == 'sparql':
        idpKG = SPARQLWrapper(endpoint)
        idpKG.method = 'POST'
        idpKG.setReturnFormat(JSON)
        logging.info("SPARQL Endpoint: %s" % endpoint)
    else:
        idpKG = ConjunctiveGraph()
        idpKG.parse(endpoint, format="nquads")
        #idpKG.serialize(format="json-ld") 
        logging.info("\tIDP-KG has %s statements." % len(idpKG))

def query_idpkg(query, loadingOpt):
    if loadingOpt == 'sparql':
        idpKG.setQuery(query)
        results = idpKG.queryAndConvert()
        #ToDo: add log message here giving number of results
        logging.info("Number of Results: %s" % str(len(results['results']['bindings'])))
        return results
    else:
        results = idpKG.query(query)
        results = json.loads(results.serialize(format="json"))
        logging.info("Number of Results: %s" % str(len(results['results']['bindings'])))
        return results

def copied_md_format(e):
    clipboard.copy(md_format)
    print('copied')

def display_query_results(query, results, query_title, queryFile):
    btnCopy = widgets.Button(
        description='Copy .md',
        disabled=False,
        button_style='', # 'success', 'info', 'warning', 'danger' or ''
        icon='download',
        layout=Layout(left='79%')
    )
    btnCopy.on_click(copied_md_format)
    accordion = widgets.Accordion(children=[widgets.HTML(value=query), widgets.VBox([btnCopy, widgets.HTML(value=results)])], 
                              selected_index=1,
                              layout=Layout(width='97%'))
    accordion.set_title(0, 'Query: ' + query_title + ' ('+'File: /queries/' + queryFile+')')
    accordion.set_title(1, 'Results')
    display(accordion)

query_editor =  widgets.Textarea(
                value='',
                disabled=False,
                rows=10, 
                layout=Layout(width='90%')
            )
    
def getQueryText(queryFile):
    with open('../queries/'+queryFile) as f:
        query = f.read()
        query_editor.value = query #html.escape(query)

def runQuery(query, queryFile):
        first_line = ''
        
        if query == 'all':
            with open('../queries/'+queryFile) as f:
                query = f.read()
                first_line = query.partition('\n')[0]

                if '#' in first_line:
                    query = query.split("\n",1)[1]
                else:
                    first_line = ''
        
        
        display(HTML('<hr />'))
        height = '40vh'
        if len(html.escape(query).splitlines()) < 40:
            height = str(len(html.escape(query).splitlines()))
        display_query = '<div style="height:'+height+';overflow-x:auto;overflow-y:auto;"><pre>'+html.escape(query)+'</pre></div>'
        logging.info('Executing query in file: /queries/' + queryFile)
        logging.debug('Query:\n' + query)
        try:
            HTMLResult = displayResults(query_idpkg(query, opt))
            display_query_results(display_query, HTMLResult, first_line.replace('#',''), queryFile)
        except Exception as e:
            logging.error('Exception running query: ' + str(e) + '\nQuery:\n ' + query + '\noptions: ' + opt)
            display_query_results(display_query, 'Exception: ' + str(e), first_line.replace('#',''), queryFile)
            
##########################################################################################################
#Create Selection GUI
rdo1 = widgets.RadioButtons(
    options=['SPARQL Endpoint:', 'Test-8', 'Sample-25', 'IDPKG-Full.nq', 'IDPKG.nq'],
    #     value='pineapple',
    #description='Pizza topping:',
    name = 'select',
    disabled=False,
    layout=Layout(width='20%')
)
    
txt = widgets.Text(
    value='https://swel.macs.hw.ac.uk/data/repositories/idpkg',
    placeholder='Enter endpoint',
    disabled=False,
    layout=Layout(width='80%', height='30px')
)

dropdown = widgets.Dropdown(
            options=query_options,
            value='select'
        )

output = widgets.Output()

def on_change(change):
    if change['type'] == 'change' and change['name'] == 'value':
        hideMessage()
        output.clear_output()
        if 'select' != change['new']:
            if 'all' != change['new']:
                query_editor.layout.display = ''
                getQueryText(change['new'])
            else:
                query_editor.value = ''
                query_editor.layout.display = 'none'
        else:
            query_editor.value = ''
            
dropdown.observe(on_change)

msgHTML = widgets.HTML(value = '')

btn = widgets.Button(
    description='Execute',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Execute',
    icon='check',
    #layout=Layout(flex='0 1 auto', width='100px', align = "flex-end")
)

def showMessage(msg):
    msgHTML.value = f'''<div class="alert">
                          <strong>Error:</strong> {msg}
                     </div>'''
def hideMessage():
    msgHTML.value = ''

def loading():
    btn.icon = 'hourglass-1'
    btn.description = 'Busy'
    btn.disabled = True
    
def endLoading():
    btn.icon = 'check'
    btn.description = 'Execute'
    btn.disabled = False

def createSelectionGUI():
    btn.on_click(on_button_clicked)
    display(widgets.HTML(value='<h2>Select a source</h2><p>Use the IDPKG.nq option to load the file generated by the ETLProcess notebook.</p>'),widgets.HBox([rdo1, txt]), widgets.HTML(value='<hr />'), widgets.HTML(value='<h2>Select a query</h2><p>Use the dropdown list to select a query and then click the Execute button to run the query.</p><p>Choosing All Queries will run each query in turn and print the results.</p><p>You may edit the query text in the box below before executing the query.</p>'),widgets.VBox([dropdown, query_editor]), widgets.VBox([btn, msgHTML]), output)

def on_button_clicked(e):
    loading()
    hideMessage()
    with output:
        global opt
        if 'sparql' in rdo1.value.lower():
            if txt.value == '':
                #display(HTML('<span style="color:red">Please enter SPARQL endpoint.</span>'))
                showMessage('Please enter SPARQL endpoint.')
                opt = ''
            else:
                set_variable('sparql', txt.value)
                opt = 'sparql'
        else:
            nqFile = ''
            if 'test-8' in rdo1.value.lower():
                nqFile = 'IDPKG-Sample8.nq'
            elif 'sample-25' in rdo1.value.lower():
                nqFile = 'IDPKG-Sample25.nq'
            elif 'idpkg-full' in rdo1.value.lower():
                nqFile = 'IDPKG-Full.nq'
            elif 'idpkg' in rdo1.value.lower():
                nqFile = 'IDPKG.nq'    
            
            set_variable('local', nqFile)
            opt = 'local'
            
        clear_output(True)
        
        #Execute query
        
        if opt != '':
            if dropdown.value != 'select':
                if dropdown.value == 'all':
                    for key in queryOrder:
                        logging.info('Run query %d' % key)
                        runQuery('all', queryOrder[key])
                else:
                    if query_editor.value != '':
                        logging.info('Run query %s' % dropdown.value)
                        sparql_query = query_editor.value
                        runQuery(sparql_query, dropdown.value)
                    else:
                        showMessage('There is no query to execute.')
            else:
                showMessage('Please select the query to execute.')
        
        endLoading()

## IDP-KG Query Interface

The following cell geneerates the UI and allows the selection of the KG and query to run. 

The results of the query execution are displayed below. The query box can be expanded to show the text of the query. The results box is scrollable.

Descriptions of the first 14 quereies can be found in the [HCLS Dataset Description Community Profile](https://www.w3.org/TR/hcls-dataset/#s6_6). The other queries are explained [above](#IDP-KG-Analysis-Queries).

Note that when using options other the SPARQL Endpoint, the query execution is conducted using the RDFlib package, which does not have a correct implementation of the `GROUP_CONCAT` operator.

In [None]:
createSelectionGUI()